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Thou must bear the sorrow that thou claimst to heal; 
The day-bringer must walk in darkest night. 
He who would save the world must share its pain. 
If he knows not grief, how shall he find grief's cure? 

Savitri — Sri Aurobindo 



In loving memory of my father and grandmother 
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A Third Need in Engineering Education 

At its inception, engineering education was predominantly process oriented, while engineering 
practice tended to be predominantly system oriented^. While it was invaluable to have a strong 
fundamental knowledge of the processes, educators realized the need to have courses where 
this knowledge translated into an ability to design systems; therefore, most universities, starting 
in the 1970s, mandated that seniors take at least one design/capstone course. However, a third 
aspect is acquiring increasing importance: the need to analyze, interpret and model data. Such 
a skill set is proving to be crucial in all scientific activities, none so as much as in engineering 
and the physical sciences. How can data collected from a piece of equipment be used to assess 
the claims of the manufacturers? How can performance data either from a natural system or a 
man-made system be respectively used to maintain it more sustainably or to operate it more 
efficiently? Such needs are driven by the fact that system performance data is easily available 
in our present-day digital age where sensor and data acquisition systems have become reliable, 
cheap and part of the system design itself. This applies both to experimental data (gathered 
from experiments performed according to some predetermined strategy) and to observational 
data (where one can neither intrude on system functioning nor have the ability to control the 
experiment, such as in astronomy). Techniques for data analysis also differ depending on the 
size of the data; smaller data sets may require the use of "prior" knowledge of how the system is 
expected to behave or how similar systems have been known to behave in the past. 

Let us consider a specific instance of observational data: once a system is designed and 
built, how to evaluate its condition in terms of design intent and, if possible, operate it in an 
"optimal" manner under variable operating conditions (say, based on cost, or on minimal envi- 
ronmental impact such as carbon footprint, or any appropriate pre-specified objective). Thus, 
data analysis and data driven modeling methods as applied to this instance can be meant to 
achieve certain practical ends — for example: 

(a) verifying stated claims of manufacturer; 

(b) product improvement or product characterization from performance data of prototype; 

(c) health monitoring of a system, i.e., how does one use quantitative approaches to reach 
sound decisions on the state or "health" of the system based on its monitored data? 

(d) controlling a system, i.e., how best to operate and control it on a day-to-day basis? 

(e) identifying measures to improve system performance, and assess impact of these measu- 
res; 

(f) verification of the performance of implemented measures, i.e., are the remedial measures 
implemented impacting system performance as intended? 



' Stoecker, W.F., 1989. Design of Thermal Systems, "i"^ Edition, McGraw-Hill, New York 
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Intent 

Data analysis and modeling is not an end in itself; it is a well-proven and often indispensable 
aid for subsequent decision-making such as allowing realistic assessment and predictions to 
be made concerning verifying expected behavior, the current operational state of the system 
and/or the impact of any intended structural or operational changes. It has its roots in sta- 
tistics, probability, regression, mathematics (linear algebra, differential equations, numerical 
methods,. . .), modeling and decision making. Engineering and science graduates are somewhat 
comfortable with mathematics while they do not usually get any exposure to decision analysis 
at all. Statistics, probability and regression analysis are usually squeezed into a sophomore 
term resulting in them remaining "a shadowy mathematical nightmare, and ... a weakness 
forever"^ even to academically good graduates. Further, many of these concepts, tools and 
procedures are taught as disparate courses not only in physical sciences and engineering but 
in life sciences, statistics and econometric departments. This has led to many in the physical 
sciences and engineering communities having a pervasive "mental block" or apprehensiveness 
or lack of appreciation of this discipline altogether. Though these analysis skills can be learnt 
over several years by some (while some never learn it well enough to be comfortable even after 
several years of practice), what is needed is a textbook which provides: 

1. A review of classical statistics and probability concepts, 

2. A basic and unified perspective of the various techniques of data based mathematical mo- 
deling and analysis, 

3. an understanding of the "process" along with the tools, 

4. a proper combination of classical methods with the more recent machine learning and auto- 
mated tools which the wide spread use of computers has spawned, and 

5. well-conceived examples and problems involving real- world data that would illustrate these 
concepts within the purview of specific areas of application. 

Such a text is likely to dispel the current sense of unease and provide readers with the neces- 
sary measure of practical understanding and confidence in being able to interpret their num- 
bers rather than merely generating them. This would also have the added benefit of advancing 
the current state of knowledge and practice in that the professional and research community 
would better appreciate, absorb and even contribute to the numerous research publications in 
this area. 



Approach and Scope 

Forward models needed for system simulation and design have been addressed in numerous 
textbooks and have been well-inculcated into the undergraduate engineering and science cur- 
riculum for several decades. It is the issue of data-driven methods, which I feel is inadequately 
reinforced in undergraduate and first-year graduate curricula, and hence the basic rationale for 
this book. Further, this book is not meant to be a monograph or a compilation of information 
on papers i.e., not a literature review. It is meant to serve as a textbook for senior undergraduate 
or first-year graduate students or for continuing education professional courses, as well as a 
self-study reference book for working professionals with adequate background. 



' Keller, D.K., 2006. The Tao of Statistics, Saga Publications, London, U.K 
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Applied statistics and data based analysis methods find applications in various engineering, 
business, medical, and physical, natural and social sciences. Though the basic concepts are the 
same, the diversity in these disciplines results in rather different focus and differing emphasis 
of the analysis methods. This diversity may be in the process itself, in the type and quantity 
of data, and in the intended purpose of the analysis. For example, many engineering systems 
have low "epistemic" uncertainty or uncertainty associated with the process itself, and, also 
allow easy gathering of adequate performance data. Such models are typically characterized 
by strong relationships between variables which can be formulated in mechanistic terms and 
accurate models consequently identified. This is in stark contrast to such fields as economics 
and social sciences where even qualitative causal behavior is often speculative, and the quan- 
tity and uncertainty in data rather poor. In fact, even different types of engineered and natural 
systems require widely different analysis tools. For example, electrical and specific mechani- 
cal engineering disciplines (ex. involving rotary equipment) largely rely on frequency domain 
analysis methods, while time-domain methods are more suitable for most thermal and environ- 
mental systems. This consideration has led me to limit the scope of the analysis techniques 
described in this book to thermal, energy-related, environmental and industrial systems. 

There are those students for whom a mathematical treatment and justification helps in better 
comprehension of the underlying concepts. However, my personal experience has been that the 
great majority of engineers do not fall in this category, and hence a more pragmatic approach 
is adopted. I am not particularly concerned with proofs, deductions and statistical rigor which 
tend to overwhelm the average engineering student. The intent is, rather, to impart a broad con- 
ceptual and theoretical understanding as well as a solid working familiarity (by means of case 
studies) of the various facets of data-driven modeling and analysis as applied to thermal and 
environmental systems. On the other hand, this is not a cookbook nor meant to be a reference 
book listing various models of the numerous equipment and systems which comprise thermal 
systems, but rather stresses underlying scientific, engineering, statistical and analysis concepts. 
It should not be considered as a substitute for specialized books nor should their importance be 
trivialized. A good general professional needs to be familiar, if not proficient, with a number 
of different analysis tools and how they "map" with each other, so that he can select the most 
appropriate tools for the occasion. Though nothing can replace hands-on experience in design 
and data analysis, being familiar with the appropriate theoretical concepts would not only shor- 
ten modeling and analysis time but also enable better engineering analysis to be performed. 
Further, those who have gone through this book will gain the required basic understanding 
to tackle the more advanced topics dealt with in the literature at large, and hence, elevate the 
profession as a whole. This book has been written with a certain amount of zeal in the hope 
that this will give this field some impetus and lead to its gradual emergence as an identifiable 
and important discipline (just as that enjoyed by a course on modeling, simulation and design 
of systems) and would ultimately be a required senior-level course or first-year graduate course 
in most engineering and science curricula. 

This book has been intentionally structured so that the same topics (namely, statistics, para- 
meter estimation and data collection) are treated first from a "basic" level, primarily by revie- 
wing the essentials, and then from an "intermediate" level. This would allow the book to have 
broader appeal, and allow a gentler absorption of the needed material by certain students and 
practicing professionals. As pointed out by Asimov^, the Greeks demonstrated that abstraction 



Asimov, 1., 1966. Understanding Physics: Light Magnetism and Electricity, Walker Publications. 
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(or simplification) in physics allowed a simple and generalized mathematical structure to be 
formulated which led to greater understanding than would otherwise, along with the ability to 
subsequently restore some of the real- world complicating factors which were ignored earlier. 
Most textbooks implicitly follow this premise by presenting "simplistic" illustrative examples 
and problems. I strongly believe that a book on data analysis should also expose the student 
to the "messiness" present in real-world data. To that end, examples and problems which deal 
with case studies involving actual (either raw or marginally cleaned) data have been included. 
The hope is that this would provide the student with the necessary training and confidence to 
tackle real-world analysis situations. 



Assumed Background of Reader 

This is a book written for two sets of audiences: a basic treatment meant for the general engi- 
neering and science senior as well as the general practicing engineer on one hand, and the 
general graduate student and the more advanced professional entering the fields of thermal and 
environmental sciences. The exponential expansion of scientific and engineering knowledge 
as well as its cross-fertilization with allied emerging fields such as computer science, nano- 
technology and bio-engineering have created the need for a major reevaluation of the thermal 
science undergraduate and graduate engineering curricula. The relatively few professional and 
free electives academic slots available to students requires that traditional subject matter be 
combined into fewer classes whereby the associated loss in depth and rigor is compensated for 
by a better understanding of the connections among different topics within a given discipline 
as well as between traditional and newer ones. 

It is presumed that the reader has the necessary academic background (at the undergraduate 
level) of traditional topics such as physics, mathematics (linear algebra and calculus), fluids, 
thermodynamics and heat transfer, as well as some exposure to experimental methods, proba- 
bility, statistics and regression analysis (taught in lab courses at the freshman or sophomore 
level). Further, it is assumed that the reader has some basic familiarity with important energy 
and environmental issues facing society today. However, special effort has been made to pro- 
vide pertinent review of such material so as to make this into a sufficiently self-contained 
book. 

Most students and professionals are familiar with the uses and capabilities of the ubiquitous 
spreadsheet program. Though many of the problems can be solved with the existing (or add- 
ons) capabilities of such spreadsheet programs, it is urged that the instructor or reader select 
an appropriate statistical program to do the statistical computing work because of the added 
sophistication which it provides. This book does not delve into how to use these programs, 
rather, the focus of this book is education-based intended to provide knowledge and skill sets 
necessary for value, judgment and confidence on how to use them, as against training-based 
whose focus would be to teach facts and specialized software. 
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Mathematical Models and Data Analysis 



1 



This chapter starts by introducing the benefits of applied data 
analysis and modeling methods through a case study exam- 
ple pertinent to energy use in buildings. Next, it reviews fun- 
damental notions of mathematical models, illustrates them in 
terms of sensor response, and differentiates between forward 
or simulation models and inverse models. Subsequently, va- 
rious issues pertinent to data analysis and associated uncer- 
tainty are described, and the different analysis tools which 
fall within its purview are discussed. Basic concepts relating 
to white-box, black-box and grey-box models are then pre- 
sented. An attempt is made to identify the different types of 
problems one faces with forward modeling as distinct from 
inverse modeling and analysis. Notions germane to the disci- 
plines of decision analysis, data mining and intelligent data 
analysis are also covered. Finally, the various topics covered 
in each chapter of this book are described. 



1.1 Introduction 

Applied data analysis and modeling of system performance 
is historically older than simulation modeling. The ancients, 
starting as far back as 7000 years ago, observed the move- 
ments of the sun, moon and stars in order to predict their 
behavior and initiate certain tasks such as planting crops 
or readying for winter. Theirs was a necessity impelled by 
survival; surprisingly, still relevant today. The threat of cli- 
mate change and its dire consequences are being studied by 
scientists using in essence similar types of analysis tools — 
tools that involve measured data to refine and calibrate their 
models, extrapolating and evaluating the effect of different 
scenarios and mitigation measures. These tools fall under the 
general purview of data analysis and modeling methods, and 
it would be expedient to illustrate their potential and useful- 
ness with a case study application which the reader can relate 
to more practically. 

One of the current major societal problems facing man- 
kind is the issue of energy, not only due to the gradual de- 
pletion of fossil fuels but also due to the adverse climatic 



and health effects which their burning creates. In 2005, 
total worldwide energy consumption was about 500 Exa- 
joules (=500x10'* J), which is equivalent to about 16 TW 
(= 16 X 10'^ W). The annual growth rate was about 2%, which, 
at this rate, suggests a doubling time of 35 years. The United 
States (U.S.) accounts for 23% of the world-wide energy use 
(with only 5% of the world's population!), while the build- 
ing sector alone (residential plus commercial buildings) in 
the U.S. consumes about 40% of the total energy use, close 
to 70% of the electricity generated, and is responsible for 
49% of the SOx and 35% of the CO^ emitted. Improvement 
in energy efficiency in all sectors of the economy has been 
rightly identified as a major and pressing need, and aggressi- 
ve programs and measures are being implemented worldwi- 
de. It has been estimated that industrial countries are likely 
to see 25-35% in energy efficiency gains over the next 20 
years, and more than 40% in developing countries (Jochem 
2000). Hence, energy efficiency improvement in buildings 
is a logical choice for priority action. This can be achieved 
both by encouraging low energy building designs, but also 
by operating existing buildings more energy efficiently. In 
the 2003 Buildings Energy Consumption Survey (CBECS) 
study by U.S. Department of Energy (USDOE), over 85% of 
the building stock (excluding malls) was built before 1990. 
Further, according to USDOE 2008 Building Energy Data 
book, the U.S. spends $ 785 billion (6.1% of GDP) on new 
construction and $ 483 billion (3.3% of GDP) on improve- 
ments and repairs of existing buildings. A study of 60 com- 
mercial buildings in the U.S. found that half of them had 
control problems and about 40% had problems with the hea- 
ting and cooling equipment (PECI 1997). This seems to be 
the norm. Enhanced commissioning processes in commerci- 
al/institutional buildings which do not compromise occupant 
comfort are being aggressively developed which have been 
shown to reduce energy costs by over 20% and in several 
cases over 50% (Claridge and Liu 2001). Further, existing 
techniques and technologies in energy efficiency retrofitting 
can reduce home energy use by up to 40% per home and 
lower associated greenhouse gas emissions by up to 160 
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million metric tons annually by the year 2020. Identifying 
energy conservation opportunities, verifying by monitoring 
whether anticipated benefits are in fact realized when such 
measures are implemented, optimal operating of buildings; 
all these tasks require skills in data analysis and modeling. 

Building energy simulation models (or forward models) 
are mechanistic (i.e., based on a mathematical formulation 
of the physical behavior) and deterministic (i.e. where there 
is no randomness in the inputs or outputs)'. They require as 
inputs the hourly climatic data of the selected location, the 
layout, orientation and physical description of the building 
(such as wall material, thickness, glazing type and fraction, 
type of shading overhangs,...), the type of mechanical and 
electrical systems available inside the building in terms of 
air distribution strategy, performance specifications of pri- 
mary equipment (chillers, boilers,...), and the hourly opera- 
ting and occupant schedules of the building. The simulation 
predicts hourly energy use during the entire year from which 
monthly total energy use and peak use along with utility rates 
provide an estimate of the operating cost of the building. The 
primary benefit of such a forward simulation model is that 
it is based on sound engineering principles usually taught 
in colleges and universities, and consequently has gained 
widespread acceptance by the design and professional com- 
munity. Major public domain simulation codes (for example. 
Energy Plus 2009) have been developed with hundreds of 
man-years invested in their development by very competent 
professionals. This modeling approach is generally useful 
for design purposes where different design options are to be 
evaluated before the actual system is built. 

Data analysis and modeling methods, on the other hand, 
are used when performance data of the system is available, 
and one uses this data for certain specific purposes, such as 
predicting or controlling the behavior of the system under 
different operating conditions, or for identifying energy con- 
servation opportunities, or for verifying the effect of energy 
conservation measures and commissioning practices once 
implemented, or even to verify that the system is performing 
as intended (called condition monitoring). Consider the case 
of an existing building whose energy consumption is known 
(either utility bill data or monitored data). Some of the rele- 
vant questions which a building professional may apply data 
analysis methods are: 

(a) Commissioning tests: How can one evaluate whether a 
component or a system is installed and commissioned 
properly? 

(b) Comparison with design intent: How does the con- 
sumption compare with design predictions? In case of 
discrepancies, are they due to anomalous weather, to 
unintended building operation, to improper operation 
or to other causes? 



' These terms will be described more fully in Sect. 1.2.3. 



(c) Demand Side Management (DSM): How would the 
consumption reduce if certain operational changes are 
made, such as lowering thermostat settings, ventilation 
rates or indoor lighting levels? 

(d) Operation and maintenance (O&M): How much energy 
could be saved by retrofits to building shell, changes to 
air handler operation from constant air volume to va- 
riable air volume operation, or due to changes in the va- 
rious control settings, or due to replacing the old chiller 
with a new and more energy efficient one? 

(e) Monitoring and verification (M&V): If the retrofits 
are implemented to the system, can one verify that the 
savings are due to the retrofit, and not to other causes, 
e.g. the weather or changes in building occupancy? 

(f) Automated fault detection, diagnosis and evaluation 
(AFDDE): How can one automatically detect faults in 
heating, ventilating, air-conditioning and refrigerating 
(HVAC&R) equipment which reduce operating life and/ 
or increase energy use? What are the financial implica- 
tions of this degradation? Should this fault be rectified 
immediately or at a later time? What specific measures 
need to be taken? 

(g) Optimal operation: How can one characterize HVAC&R 
equipment (such as chillers, boilers, fans, pumps,...) in 
their installed state and optimize the control and operation 
of the entire system? 

All the above questions are better addressed by data ana- 
lysis methods. The forward approach could also be used, by 
say, (i) going back to the blueprints of the building and of 
the HVAC system, and repeating the analysis performed at 
the design stage while using actual building schedules and 
operating modes, and (ii) performing a calibration or tuning 
of the simulation model (i.e., varying the inputs in some 
fashion) since actual performance is unlikely to match obser- 
ved performance. This process is, however, tedious and much 
effort has been invested by the building professional commu- 
nity in this regard with only limited success (Reddy 2006). 
A critical limitation of the calibrated simulation approach is 
that the data being used to tune the forward simulation mo- 
del must meet certain criteria, and even then, all the nume- 
rous inputs required by the forward simulation model cannot 
be mathematically identified (this is referred to as an over- 
parameterized problem). Though awkward, labor intensive 
and not entirely satisfactory in its current state of development, 
the calibrated building energy simulation model is still an at- 
tractive option and has its place in the toolkit of data analysis 
methods (discussed at length in Sect. 11.2). The fundamental 
difficulty is that there is no general and widely-used model 
or software for dealing with data driven applications as they 
apply to building energy, though specialized software have 
been developed which allow certain types of narrow analysis 
to be performed. In fact, given the wide diversity in applica- 
tions of data driven models, it is unlikely that any one metho- 
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dology or software program will ever suffice. This leads to 
the basic premise of this book that there exists a crucial need 
for building energy professionals to be familiar and compe- 
tent with data analysis methods and tools so that they could 
select the one which best meets their purpose with the end 
result that buildings will be operated and managed in a much 
more energy efficient manner than currently. 

Building design simulation tools have played a significant 
role in lowering energy use in buildings. These are neces- 
sary tools and their importance should not be understated. 
Historically, most of the business revenue in Architectural 
Engineering and HVAC&R firms was generated from de- 
sign/build contracts which required extensive use of simula- 
tion programs. Hence, the professional community is fairly 
well knowledgeable in this area, and several universities te- 
ach classes geared towards the use of simulation programs. 
However, there is an increasing market potential in building 
energy services as evidenced by the number of firms which 
offer services in this area. The acquisition of the required un- 
derstanding, skills and tools relevant to this aspect is different 
from those required during the building design phase. There 
are other market forces which are also at play. The recent in- 
terest in "green" and "sustainable" has resulted in a plethora 
of products and practices aggressively marketed by numerous 
companies. Often, the claims that this product can save much 
more energy that another, and that that device is more envi- 
ronmentally friendly than others, are unfortunately, unfoun- 
ded under closer scrutiny. Such types of unbiased evaluations 
and independent verification are imperative, otherwise the 
whole "green" movement may degrade into mere "green- 
washing" and a feel-good attitude as against partially over- 
coming a dire societal challenge. A sound understanding of 
applied data analysis is imperative for this purpose and future 
science and engineering graduates have an important role to 
play. Thus, the raison d'etre of this book is to provide a gene- 
ral introduction and a broad foundation to the mathematical, 
statistical and modeling aspects of data analysis methods. 



1.2 Mathematical Models 

1 .2.1 Types of Data 

Data^ can be classified in different ways. One classification 
scheme is as follows (Weiss and Hassett 1982): 
• categorical data (also called nominal or qualitative) refers to 
data that has non-numerical qualities or attributes, such as be- 
longing to one of several categories; for example, male/female. 



10 



^ Several authors make a strict distinction between "data" which is plu- 
ral and "datum" which is singular and implies a single data point. No 
such distinction is made throughout this book, and the word "data" is 
used to imply either. 
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Fig. 1.1 The roUing of a dice is an example of discrete data where the 
data can only assume whole numbers. If the dice is fair, one would ex- 
pect that out of 60 throws, numbers 1 through 6 would appear an equal 
number of times. However, in reality one inay get sinall variations about 
the expected values as shown in the figure 



type of engineering major, fail/pass, satisfactory /not satisfac- 
tory,...; 

• ordinal data, i.e., data that has some order or rank, such as 
a building envelope which is leaky, medium or tight, or a 
day which is hot, mild or cold; 

• metric data, i.e., data obtained from measurements of such 
quantities as time, weight and height. Further, there are 
two different kinds of metric data: (i) data measured on an 
interval scale which has an arbitrary zero point (such as 
the Celsius scale); and (ii) data measured on a ratio scale 
which has a zero point that cannot be arbitrarily changed 
(such as mass or volume). 

• count data, i.e., data on the number of individuals or items 
falling into certain classes or categories. 

A common type of classification relevant to metric data is 
to separate data into: 

• discrete data which can take on only a finite or countable 
number of values (most qualitative, ordinal and count data 
fall in this category). An example is the data one would 
expect by rolling a dice 60 times (Fig. 1.1); 

• continuous data which may take on any value in an interval 
(most metric data is continuous, and hence, is not coun- 
table). For example, the daily average outdoor dry-bulb 
temperature in Philadelphia, PA over a year (Fig. 1.2). 
For data analysis purposes, it is important to view data ba- 
sed on their dimensionality, i.e., the number of axes needed 
to graphically present the data. A univariate data set consists 
of observations based on a single variable, bivariate those 
based on two variables, and multivariate those based on more 
than two variables. 

The source or origin of the data can be one of the follo- 
wing: 

(a) Population is the collection or set of all individuals (or 
items, or characteristics) representing the same quantity 
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Fig. 1.2 Continuous data sepa- 
rated into a large number of bins 
(in this case, 300) resulted in the 
above histogram of the hourly 
outdoor dry-bulb temperature 
(in °F) in Philadelphia, PA over 
a year A smoother distribution 
would have resulted if a smaller 
number of bins had been selected 
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with a connotation of completeness, i.e., the entire 
group of items being studied whether they be the fresh- 
men student body of a university, instrument readings 
of a test quantity, or points on a curve. 

(b) Sample is a portion or limited number of items from a 
population from which information or readings are col- 
lected. There are again two types of samples: 

- Single-sample is a single reading or succession of 
readings taken at the same time or under different 
times but under identical conditions; 

- Multi-sample is a repeated measurement of a fixed 
quantity using altered test conditions, such as diffe- 
rent observers or different instruments or both. 

Many experiments may appear to be multi-sample data 
but are actually single-sample data. For example, if the 
same instrument is used for data collection during diffe- 
rent times, the data should be regarded as single-sample 
not multi-sample. 

(c) Two-stage experiments are successive staged experi- 
ments where the chance results of the first stage deter- 
mines the conditions under which the next stage will be 
carried out. For example, when checking the quality of a 
lot of mass-produced articles, it is frequently possible to 
decrease the average sample size by carrying out the in- 
spection in two stages. One may first take a small sample 
and accept the lot if all articles in the sample are satisfac- 
tory; otherwise a large second sample is inspected. 

Finally, one needs to distinguish between: (i) a duplicate 
which is a separate specimen taken from the same source as 
the first specimen, and tested at the same time and in the same 
manner, and (ii) replicate which is the same specimen tested 



again at a different time. Thus, while duplication allows one 
to test samples till they are destroyed (such a tensile testing 
of an iron specimen), replicate testing stops short of doing 
permanent damage to the samples. 

One can differentiate between different types of multi- 
sample data. Consider the case of solar thermal collector 
testing (as described in Pr. 5.6 of Chap. 5). In essence, the 
collector is subjected to different inlet fluid temperature le- 
vels under different values of incident solar radiation and 
ambient air temperatures using an experimental facility with 
instrumentation of pre-specified accuracy levels. The test 
results are processed according to certain performance mo- 
dels and the data plotted against collector efficiency versus 
reduced temperature level. The test protocol would involve 
performing replicate tests under similar reduced temperatu- 
re levels, and this is one type of multi-sample data. Another 
type of multi-sample data would be the case when the same 
collector is tested at different test facilities nation-wide. The 
results of such a "round-robin" test are shown in Fig. 1.3 
where one detects variations around the trend line given by 
the performance model which can be attributed to differen- 
ces in both instrumentation and in sUght differences in the 
test procedures from one facility to another. 



1 .2.2 What is a System Model? 

A system is the object under study which could be as simple 
or as complex as one may wish to consider. It is any ordered, 
inter-related set of things, and their attributes. A model is a 
construct which allows one to represent the real-life system 
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Fig. 1.3 Example of multi-sample data in the framework of a "round- 
robin" experiment of testing the same solar thermal collector in six dif- 
ferent test facilities (shown by different symbols) following the same 
testing methodology. The test data is used to determine and plot the col- 
lector efficiency versus the reduced temperature along with uncertainty 
bands (see Pr 5.6 for nomenclature). (Streed et al. 1979) 

SO that it can be used to predict the future behavior of the sys- 
tem under various "what-if scenarios. The construct could 
be a scaled down physical version of the actual system (wi- 
dely followed historically in engineering) or a mental cons- 
truct, which is what is addressed in this book. The develop- 
ment of a model is not the ultimate objective, in other words, 
it is not an end by itself. It is a means to an end, the end being 
a credible means to make decisions which could involve sys- 
tem-specific issues (such as gaining insights about influential 
drivers and system dynamics, or predicting system behavior, 
or determining optimal control conditions) as well as those 
involving a broader context (such as operation management, 
deciding on policy measures and planning,...). 



1 .2.3 Types of Models 

One differentiates between different types of models: 

(i) intuitive models (or qualitative or descriptive models) 
are those where the system's behavior is summarized in 
non- analytical forms because only general qualitative 
trends of the system are known. Such a model which 
relies on quantitative or ordinal data is an aid to thought 
or to communication. Sociological or anthropological 
behaviors are typical examples; 

(ii) empirical models which use metric or count data are 
those where the properties of the system can be sum- 
marized in a graph, a table or a curve fit to observation 
points. Such models presume knowledge of the funda- 
mental quantitative trends but lack accurate understan- 
ding. Econometric models are typical examples; and 

(iii) mechanistic models (or structural models) which use 
metric or count data are based on mathematical relati- 



onships used to describe physical laws such as Newton's 
laws, the laws of thermodynamics, etc... Such models 
can be used for prediction (system design) or for proper 
system operation and control (data analysis). Further 
such models can be separated into two sub-groups: 

- exact structural models where the model equation 
is thought to apply rigorously, i.e., the relationship 
between and variables and parameters in the model 
is exact, or as close to exact as current state of scien- 
tific understanding permits, and 

- inexact structural models where the model equation 
applies only approximately, either because the pro- 
cess is not fully known or because one chose to sim- 
plify the exact model so as to make it more usable. 
A typical example is the dose-response model which 
characterizes the relation between the amount of to- 
xic agent imbibed by an individual and the incidence 
of adverse health effect. 

Further, one can envision two different types of systems: 
open systems in which either energy and/or matter flows into 
and out of the system, and closed systems in which neither 
energy nor matter is exchanged to the environment. 

A system model is a description of the system. Empirical 
and mechanistic models are made up of three components: 
(i) input variables (also referred to as regressor, forcing, 
exciting, exogenous or independent variables in the en- 
gineering, statistical and econometric literature) which 
act on the system. Note that there are two types of such 
variables: controllable by the experimenter, and uncon- 
trollable or extraneous variables, such as climatic va- 
riables; 
(ii) system structure and parameters/properties which pro- 
vide the necessary physical description of the systems 
in terms of physical and material constants; for exam- 
ple, thermal mass, overall heat transfer coefficients, me- 
chanical properties of the elements; and 
(iii) output variables (also called response, state, endoge- 
nous or dependent variables) which describe system 
response to the input variables. 
A structural model of a system is a mathematical rela- 
tionship between one or several input variables and parame- 
ters and one or several output variables. Its primary purpose 
is to allow better physical understanding of the phenome- 
non or process or alternatively, to allow accurate prediction 
of system reaction. This is useful for several purposes, for 
example, preventing adverse phenomenon from occurring, 
for proper system design (or optimization) or to improve 
system performance by evaluating other modifications to the 
system. A satisfactory mathematical model is subject to two 
contradictory requirements (Edwards and Penney 1996): it 
must be sufficiently detailed to represent the phenomenon it 
is attempting to explain or capture, yet it must be sufficient- 
ly simple to make the mathematical analysis practical. This 



1 Mathematical Models and Data Analysis 



requires judgment and experience of the modeler backed by 
experimentation and validation^. 

Examples of Simple Models: 

(a) Pressure drop Ap of a fluid flowing at velocity v through 
a pipe of hydraulic diameter D^^ and length L: 



(b) 



(d) 



L v' 



(1.1) 



where /is the friction factor, and p is the density of the 
fluid. For a given system, v can be viewed as the inde- 
pendent or input variable, while the pressure drop is the 
state variable. The factors f, L and D^^ are the system or 
model parameters and p is a property of the fluid. Note 
that the friction factor f is itself a function of the veloci- 
ty, thus making the problem a bit more complex. 
Rate of heat transfer from a fluid to a surrounding so- 
lid: 



Q = UA{Tf - T„) 



(1.2) 



where the parameter UA is the overall heat conductance, 
and T and T are the mean fluid and solid temperatures 



(which are the input variables), 
(c) Rate of heat added to a flowing fluid: 

Q =m Cp{Tou, - Tin) 



(1.3) 



where m is the fluid mass flow rate, c is its specific 
heat at constant pressure, and T and T are the exit and 

^ ' QUI III 

inlet fluid temperatures. It is left to the reader to identify 
the input variables, state variables and the model para- 
meters. 

Lumped model of the water temperature T in a storage 
tank with an immersed heating element and losing heat 
to the environment is given by the first order ordinary 
differential equation (ODE): 



MCn 



dT, 
dt 



P - UA{T, - Ti) 



(1.4) 



where Mc is the thermal heat capacitance of the tank 
(water plus tank material), 

T the environment temperature, and P is the auxiliary 
power (or heat rate) supplied to the tank. It is left to the 
reader to identify the input variables, state variables and 
the model parameters. 



^ Validation is defined as the process of bringing tfie user's confidence 
about the model to an acceptable level either by comparing its perfor- 
mance to other more accepted models or by experimentation. 



Table 1 .1 Ways of classifying mathematical models 
Different classification methods 
1 Distributed vs lumped parameter 

Dynamic vs static or steady-state 



Deterministic vs stochastic 



2 

3 

4 Continuous vs discrete 

5 Linear vs non-linear in the functional model 

6 Linear vs non-Unear in the model parameters 

7 Time invariant vs time variant 

8 Homogeneous vs non-homogeneous 
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Simulation vs performance models 



Physics based (white box) vs data based (black box) and mix 
of both (grey box) 



1 .2.4 Classification of Mathematical Models 

Predicting the behavior of a system requires a mathematical 
representation of the system components. The process of de- 
ciding on the level of detail appropriate for the problem at 
hand is called abstraction (Cha et al. 2000). This process has 
to be undertaken with care; (i) over-simplification may result 
in loss of important system behavior predictability, while (ii) 
an overly-detailed model may result in undue data and com- 
putational resources as well as time spent in understanding 
the model assumptions and results generated. There are dif- 
ferent ways by which mathematical models can be classified. 
Some of these are shown in Table 1 . 1 and described below 
(adapted from Eisen 1988). 

(i) Distributed vs Lumped Parameter In a distributed pa- 
rameter system, the elements of the system are continuously 
distributed along the system geometry so that the variables 
they influence must be treated as differing not only in time 
but also in space, i.e., from point to point. Partial differential 
or difference equations are usually needed. Recall that a par- 
tial differential equation (PDE) is a differential equation bet- 
ween partial derivatives of an unknown function against at 
least two independent variables. One distinguishes between 
two general cases: 

• the independent variables are space variables only 

• the independent variables are both space and time variables. 
Though partial derivatives of multivariable functions are 

ordinary derivatives with respect to one variable (the other 
being kept constant), the study of PDEs is not an easy exten- 
sion of the theory for ordinary differential equations (ODEs). 
The solution of PDEs requires fundamentally different ap- 
proaches. Recall that ODEs are solved by first finding gene- 
ral solutions and then using subsidiary conditions to determi- 
ne arbitrary constants. However, such arbitrary constants in 
general solutions of ODEs are replaced by arbitrary functi- 
ons in PDE, and determination of these arbitrary functions 
using subsidiary conditions is usually impossible. In other 
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Fig. 1.4 Cooling of a solid sphere in air can be modeled as a lumped 
model provided the Biot number Bi<0.1. This number is proportional 
to the ratio of the heat conductive resistance (1/k) inside the sphere to 
the convective resistance (1/h) from the outer envelope of the sphere 
to the air 

words, general solutions of ODEs are of limited use in sol- 
ving PDEs. In general, the solution of the PDEs and subsi- 
diary conditions (called initial or boundary conditions) needs 
to be determined simultaneously. Hence, it is wise to try to 
simplify the PDE model as far as possible when dealing with 
data analysis problems. 

In a lumped parameter system, the elements are small 
enough (or the objective of the analysis is such that simplifica- 
tion is waiTanted) so that each such element can be treated as 
if it were concentrated (i.e., lumped) at one particular spatial 
point in the system. The position of the point can change with 



time but not in space. Such systems usually are adequately 
modeled by ODE or difference equations. A heated billet as 
it cools in air could be analyzed as either a distributed system 
or a lumped parameter system depending on whether the Biot 
number (Bi) is greater than or less than 0.1 (see Fig. 1.4). The 
Biot number is proportional to the ratio of the internal to the 
external heat flow resistances of the sphere, and a small Biot 
number would imply that the resistance to heat flow attribu- 
ted to internal body temperature gradient is small enough that 
it can be neglected without biasing the analysis. Thus, a small 
body with high thermal conductivity and low convection co- 
efficient can be adequately modeled as a lumped system. 

Another example of lumped model representation is the 
1-D heat flow through the wall of a building (Fig. 1 .5a) using 
the analogy between heat flow and electricity flow. The in- 
ternal and external convective film heat transfer coefficients 
are represented by h. and h^ respectively, while k,p and c 
are the thermal conductivity, density and specific heat of the 
wall material respectively. In the lower limit, the wall can be 
discretized into one lumped layer of capacitance C with two 
resistors as shown by the electric network of Fig. 1.5b (re- 
ferred to as 2R1C network). In the upper limit, the network 
can be represented by "n" nodes (see Fig. 1.5c). The 2R1C 
simplification does lead to some errors, which under certain 
circumstances is outweighed by the convenience it provides 
while yielding acceptable results. 

(ii) Dynamic vs Steady-State Dynamic models are defined 
as those which allow transient system or equipment behavior 



Fig. 1.5 Thermal networks 
to model heat flow through 
a homogeneous plane wall 
of surface area A and wall 
thickness Ax. a Schematic of 
the wall with the indoor and 
outdoor temperatures and con- 
vective heat flow coefficients, 
b Lumped model with two 
resistances and one capacitance 
(2R1C model), c Higher n"" 
order model with n layers of 
equal thickness (Ax^n). While 
all capacitances are assumed 
equal, only the (n-2) internal 
resistances (excluding the two 
end resistances) are equal 
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to be captured with explicit recognition of the time varying 
behavior of both output and input variables. The steady-state 
or static or zeroeth model is one which assumes no time va- 
riation in its input variables (and hence, no change in the 
output variable as well). One can also distinguish an inter- 
mediate type, referred to as quasi-static models. Cases arise 
when the input variables (such as incident solar radiation on 
a solar hot water panel) are constantly changing at a short 
time scale (say, at the minute scale) while it is adequate to 
predict thermal output at say hourly intervals. The dynamic 
behavior is poorly predicted by the solar collector model at 
such high frequency time scales, and so the input variables 
can be "time-averaged" so as to make them constant during 
a specific hourly interval. This is akin to introducing a "low 
pass filter" for the inputs. Thus, the use of quasi-static mo- 
dels allows one to predict the system output(s) in discrete 
time variant steps or intervals during a given day with the 
system inputs averaged (or summed) over each of the time 
intervals fed into the model. These models could be either 
zeroeth order or low order ODE. 

Dynamic models are usually represented by PDEs or, 
by ODEs when spatially lumped with respect to time. One 
could solve them directly, and the simple cases are illustrated 
in Sect. 1.2.5. Since solving these equations gets harder as 
the order of the model increases, it is often more convenient 
to recast the differential equations in a time-series formula- 
tion using response functions or transfer functions which are 
time-lagged values of the input variable(s) only, or of both 
the inputs and the response respectively. This formulation is 
discussed in Chap. 9. The steady-state or static or zeroeth 
model is one which assumes no time variation in its inputs 
or outputs. Its time series formulation results in simple al- 
gebraic equations with no time-lagged values of the input 
variable(s) appearing in the function. 

(iii) Deterministic vs Stochastic A deterministic system 
is one whose response to specified inputs under specified 
conditions is completely predictable (to within a certain ac- 
curacy of course) from physical laws. Thus, the response is 
precisely reproducible time and again. A stochastic system is 
one where the specific output can be predicted to within an 
uncertainty range only, which could be due to two reasons: 
(i) that the inputs themselves are random and vary unpredic- 
tably within a specified range of values (such as the electric 
power output of a wind turbine subject to gusting winds), 
and/or (ii) because the models are not accurate (for example, 
the dose-response of individuals when subject to asbestos in- 
halation). Concepts from probability theory are required to 
make predictions about the response. 

The majority of observed data has some stochasticity in 
them either due to measurement/miscellaneous errors or due 
to the nature of the process itself. If the random element is so 
small that it is negligible as compared to the "noise" in the 



system, then the process or system can be treated in a pure- 
ly deterministic framework. The orbits of the planets though 
well described by Kepler's laws have some small disturbances 
due to other secondary effects, but Newton was able to treat 
them as deterministic. On the other hand, Brownian motion is 
purely random, and has to be treated by stochastic methods. 

(iv) Continuous vs Discrete A continuous system is one 
in which all the essential variables are continuous in nature 
and the time that the system operates is some interval (or 
intervals) of the real numbers. Usually such systems need 
differential equations to describe them. A discrete system is 
one in which all essential variables are discrete and the time 
that the system operates is a finite subset of the real numbers. 
This system can be described by difference equations. 

In most applications in engineering, the system or process 
being studied is fundamentally continuous. However, the 
continuous output signal from a system is usually converted 
into a discrete signal by sampling. Alternatively, the continu- 
ous system can be replaced by its discrete analog which, of 
course, has a discrete signal. Hence, analysis of discrete data 
is usually more relevant in data analysis applications. 

(v) Linear vs Non-linear A system is said to be linear if and 
only if, it has the following property: if an input Xj(t) produces 
an output yj(t), and if an input x,(t) produces an output yj3)^ 
then an input [c^ \^{t)-\-c^ \iX)\ produces an output [c^ yj(t) H- 
Cj y^iff] for all pairs of inputs Xj(t) and x^{l) and all pairs of 
real number constants a^ and a,. This concept is illustrated in 
Fig. 1.6. An equivalent concept is \he principle of superposi- 
tion which states that the response of a linear system due to 
several inputs acting simultaneously is equal to the sum of 
the responses of each input acting alone. This is an extremely 
important concept since it allows the response of a complex 
system to be determined more simply by decomposing the in- 
put driving function into simpler terms, solving the equation 
for each term separately, and then summing the individual 
responses to obtain the desired aggregated response. 
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Fig. 1.6 Principle of superposition of a linear system 
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An important distinction needs to be made between a li- 
near model and a model which is linear in its parameters. For 
example, 
» y= ax^ + bx^ is linear in both model and parameters a and 

b, 

• y = asinx^ + bx^ is a non-linear model but is linear in its 
parameters, and 

• y = a exp (bx^) is non-linear in both model and parame- 
ters. 

In all fields, linear differential or difference equations are 
by far more widely used than non-linear equations. Even if 
the models are non-linear, every attempt is made, due to the 
subsequent convenience it provides, to make them linear 
either by suitable transformation (such as logarithmic trans- 
form) or by piece-wise linearization, i.e., linear approximati- 
on over a smaller range of variation. The advantages of linear 
systems over non-linear systems are many: 

• linear systems are simpler to analyze, 

• general theories are available to analyze them, 

• they do not have singular solutions (simpler engineering 
problems rarely have them anyway), 

• well-established methods are available, such as the sta- 
te space approach, for analyzing even relatively complex 
set of equations. The practical advantage with this type 
of time domain transformation is that large systems of 
higher-order ODEs can be transformed into a first order 
system of simultaneous equations which, in turn, can be 
solved rather easily by numerical methods using standard 
computer programs. 

(vi) Time Invariant vs Time Variant A system is time- 
invariant or stationary if neither the form of the equations 
characterizing the system nor the model parameters vary 
with time under either different or constant inputs; otherwise 
the system is time-variant or non-stationary. In some cases, 
when the model structure is poor and/or when the data are 
very noisy, time variant models are used requiring either 
on-line or off-line updating depending on the frequency of 
the input forcing functions and how quickly the system re- 
sponds. Examples of such instances abound in electrical en- 
gineering applications. Usually, one tends to encounter time 
invariant models in less complex thermal and environmental 
engineering applications. 

(vii) Homogeneous vs Non-homogeneous If there are no 
external inputs and the system behavior is determined ent- 
irely by its initial conditions, then the system is called ho- 
mogeneous or unforced or autonomous; otherwise it is called 
non-homogeneous or forced. Consider the general form of a 
n* order time-invariant or stationary linear ODE: 



Ay<") + By 



(«-i) , 



My" + Ny' + 0y ^ P{x) (1.5) 



where y', y" and y*"' are the first, second and n"" derivatives of 
y with respect to x, and A, B, . . . M, N and O are constants. 
The function P(x) frequently corresponds to some external 
influence on the system, and is a function of the independent 
variable. Often, the independent variable is the time variable 
t. This is intentional since time comes into play when the dy- 
namic behavior of most physical systems is modeled. Howe- 
ver, the variable t can be assigned any other physical quantity 
as appropriate. 

To completely specify the problem, i.e., to obtain a unique 
solution y(x), one needs to specify two additional factors: (i) 
the interval of x over which a solution is desired, and (ii) a set 
of n initial conditions. If these conditions are such that y(x) 
and its first (n- 1) derivatives are specified for x=0, then the 
problem is called an initial value problem. Thus, one distin- 
guishes between: 

(a) the homogeneous form where P(x) = 0, i.e., there is no 
external driving force. The solution of the differential 
equation: 

AyW + Bj,("-i) + . . . + My" + Ny' +Oy^Q (1.6) 

yields the free response of the system. The homogeneous 
solution is a general solution whose arbitrary constants 
are then evaluated using the initial (or boundary) condi- 
tions, thus making it unique to the situation. 

(b) the non-homogeneous form where ^(x) ^ and 
Eq. 1.5 applies. T\\& forced response of the system is 
associated with the case when all the initial conditi- 
ons are identically zero, i.e., y{Q),y'{Q),...y^"~^' are all 
zero. Thus, the implication is that the forced response 
is only dependent on the external forcing function P(x). 
The total response of the linear time-invariant ODE is 
the sum of the free response and the forced response 
(thanks to the superposition principle). When system 
control is being studied, slightly different terms are of- 
ten used to specify total dynamic system response: (a) 
the steady-state response is that part of the total respon- 
se which does not approach zero as time approaches 
infinity, and (b) the transient response is that part of the 
total response which approaches zero as time appro- 
aches infinity. 

(viii) Simulation Versus Performance Based The distin- 
guishing trait between simulation and performance models 
is the basis on which the model structure is framed (this ca- 
tegorization is quite important). Simulation models are used 
to predict system performance during the design phase when 
no actual system exists and alternatives are being evaluated. 
A performance based model relies on measured performan- 
ce data of the actual system to provide insights into model 
structure and to estimate its parameters. A widely accepted 
classification involves the following: 
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Table 1.2 Description of diffe- 
rent types of models 



Model type 


Time varia- 
tion of system 
inputs/outputs 


Model complexity 


Physical 
understanding 


Type of equation 


Simulation 
model 


Dynamic 
Quasi-static 


White box 
Detailed mechanistic 


High 


PDEs 
ODEs 


Performance 
model 


Quasi-static 
Steady-state 


Gray box 

Semi-empirical 

Lumped 


Medium 


ODEs 

Algebraic 


Performance 
model 


Static or 
steady-state 


Black box 

Empirical 


Low 


Algebraic 



mouel steaay-state timpirical 

ODE ordinary differential equations, PDE partial differential equations 



(a) White-box models (also called detailed mechanistic mo- 
dels, reference models or small-time step models) are 
based on the laws of physics and permit accurate and 
microscopic modeling of the various fluid flow, heat 
and mass transfer phenomenon which occur within the 
equipment or system. These are used for simulation 
purposes. Usually, temporal and spatial variations are 
considered, and these models are expressed by PDEs 
or ODEs. As shown in Table 1.2, a high level of physi- 
cal understanding is necessary to develop these models, 
complemented with some expertise in numerical ana- 
lysis in order to solve these equations. Consequently, 
these have found their niche in simulation studies which 
require dynamic and transient operating conditions to 
be accurately captured. 

(b) Black-box models (or empirical or curve-fit or data-dri- 
ven models) are based on little or no physical behavior 
of the system and rely on the available data to identify 
the model structure. These belong to one type of perfor- 
mance models which are suitable for predicting futu- 
re behavior under a similar set of operating conditions 
to those used in developing the model. However, they 
provide little or no insights into better understanding 
of the process or phenomenon dictating system beha- 
vior. Statistical methods play a big role in dealing with 
uncertainties during model identification and model 
prediction. Historically, these types of models were the 
first ones developed for engineering systems based on 
concepts from numerical methods. They are still used 
when the system is too complex to be modeled physi- 
cally, or when a "quick-and-dirty" analysis is needed. 
They are used in both simulation studies (where they 
are often used to model specific sub-systems or indivi- 
dual equipment of a larger system) and as performance 
models. 

(c) Gray-box models fall in-between the two above catego- 
ries and are best suited for performance models. A small 
number of possible model structures loosely based on 
the physics of the underlying phenomena and simplified 
in terms of time and/or space are posited, and then, the 
available data is used to identify the best model, and to 
determine the model parameters. The resulting models 



are usually lumped models based on first-order ODE or 
algebraic equations. They are primarily meant to gain 
better physical understanding of the system behavior 
and its interacting parts; they can also provide adequate 
prediction accuracy. The identification of these models 
which combine phenomenological plausibility with 
mathematical simplicity generally requires both good 
understanding of the physical phenomenon or of the 
systems/equipment being modeled, and a competence 
in statistical methods. These models are a major focus 
of this book, and they appear in several chapters. 
Several authors, for example (Sprent 1998) also use terms 
such as (i) data driven models to imply those which are sug- 
gested by the data at hand and commensurate with know- 
ledge about system behavior; this is somewhat akin to our 
definition of black-box models, and (ii) model driven appro- 
aches as those which assume a pre-specified model and the 
data is used to determine the model parameters; this is sy- 
nonymous with grey-box models as defined here. However, 
this book makes no such distinction and uses the term "data 
driven models" interchangeably with performance models so 
as not to overly obfuscate the reader 



1.2.5 Models for Sensor Response 

Let us illustrate steady-state and dynamic system responses 
using the example of measurement sensors. As stated above, 
one can categorize models into dynamic or static based on 
the time- variation of the system inputs and outputs. 

Steady-state models (also called zeroeth order models) 
are the simplest model one can use. As stated earlier, they 
apply when input variables (and hence, the output variables) 
are maintained constant. A zeroeth order model for the dy- 
namic performance of measuring systems is used (i) when 
the variation in the quantity to be measured is very slow as 
compared to how quickly the instrument responds, or (ii) as a 
standard of comparison for other more sophisticated models. 
For a zero-order instrument, the output is directly proportio- 
nal to the input, such that (Doebelin 1995): 



aoqo = boqi 



(1.7a) 
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Fig. 1.7 Step-responses of two 
tlrst-order instruments with diffe- 
rent response times with assumed 
numerical values of time (x-axis) 
and instrument reading (y-axis). 
The response is characterized by 
the time constant which is the 
time for the instrument reading to 
reach 63.2% of the steady-state 
value 
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or 



qo = Kqi 



(1.7b) 



where a^ and b^ are the system parameters, assumed time 
invariant, q and q are the output and the input quantities re- 
spectively, and K=bu/au is called the static sensitivity of the 
instrument. 

Hence, only K is required to completely specify the re- 
sponse of the instrument. Thus, the zeroeth order instrument 
is an ideal instrument; no matter how rapidly the measured 
variable changes, the output signal faithfully and instanta- 
neously reproduces the input. 

The next step in complexity used to represent measuring 
system response is the first-order model: 



ai -TT + aoqo = boqi 
dt 



or 



,dqo 
dt 



qo = Kq; 



(1.8a) 



(1.8b) 



After a step change in the input, the steady-state value of 
the output will be K times the input q.^ (just as in the zero- 
order instrument). This is shown as a dotted horizontal line 
in Fig. 1 .7 with a numerical value of 20. The time constant 
characterizes the speed of response; the smaller its value the 
faster its response, and vice versa, to any kind of input. Fi- 
gure 1 .7 illustrates the dynamic response and the associated 
time constants for two instruments when subject to a step 
change in the input. Numerically, the time constant repre- 
sents the time taken for the response to reach 63.2% of its 
final change, or to reach a value within 36.8% of the final 
value. This is easily seen from Eq. 1.9, by setting t=T, in 

which case ^^ = (1 — e~') — 0.632 . Another useful mea- 

sure of response speed for any instrument is the 5% settling 
time, i.e., the time for the output signal to get to within 5% of 
the final value. For any first-order instrument, it is equal to 3 
times the time constant. 



where x is the time constant of the instrument = z.J\, and K 1 .2.6 Block Diagrams 

is the static sensitivity of the instrument which is identical 
to the value defined for the zeroeth model. Thus, two nume- 
rical parameters are used to completely specify a first-order 
instrument. 

The solution to Eq. 1.8b for a step change in input is: 



qo(0 = K.qis(l - e-"') 



(1.9) 



Information flow or block diagram'^ is a standard shorthand 
manner of schematically representing the inputs and output 
quantities of an element or a system as well as the compu- 
tational sequence of variables. It is a concept widely used 
during system simulation since a block implies that its output 



where q is the value of the input quantity after the step 
change. 



■* Block diagrams should not be confused with material flow diagrams 
which for a given system configuration are unique. On the other hand, 
there can be numerous ways of assembling block diagrams depending 
on how the problem is framed. 
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•Pz 



Fig. 1.8 Schematic of a centrifugal pump rotating at speed s (say, in 
rpm) wtiich pumps a water flow rate v from lower pressure p^ to higher 
pressure p. 



can be calculated provided the inputs are known. They are 
very useful for setting up the set of model equations to sol- 
ve in order to simulate or analyze systems or components. 
As illustrated in Fig. 1.8, a centrifugal pump could be repre- 
sented as one of many possible block diagrams (as shown in 
Fig. 1 .9) depending on which parameters are of interest. If the 
model equation is cast in a form such that the outlet pressure 
Pj is the response variable and the inlet pressure p^ and the 
fluid flow volumetric rate v are the forcing variables, then the 
associated block diagram is that shown in Fig. 1.9a. Another 
type of block diagram is shown in Fig. 1 .9b where flow rate 
V is the response variable. The arrows indicate the direction 
of unilateral information or signal flow. Thus, such diagrams 
depict the manner in which the simulation models of the vari- 
ous components of a system need to be formulated. 

In general, a system or process is subject to one or more 
inputs (or stimulus or excitation or forcing functions) to 
which it responds by producing one or more outputs (or 
system response). If the observer is unable to act on the sys- 
tem, i.e., change some or any of the inputs, so as to produce 
a desired output, the system is not amenable to control. If 
however, the inputs can be varied, then control is feasible. 
Thus, a control system is defined as an arrangement of phy- 
sical components connected or related in such a manner 
as to command, direct, or regulate itself or another system 
(Stubberud et al. 1994). 
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Fig. 1.9 Different block diagrams for modeling a pump depending on 
how the problem is formulated 



One needs to distinguish between open and closed loops, 
and block diagrams provide a convenient way of doing so. 

(a) Open loop control system is one in which the control 
action is independent of the output (see Fig. 1 .10a). If the be- 
havior of an open loop system is not completely understood 
or if unexpected disturbances act on it, then there may be 
considerable and unpredictable variations in the output. Two 
important features are: (i) their ability to perform accurately 
is determined by their calibration, i.e., by how accurately 
one is able to establish the input-output relationship; and (ii) 
they are generally not unstable. A practical example is an 
automatic toaster which is simply controlled by a timer. 

(b) Closed loop control system, also referred to as a feed- 
back control system, is one in which the control action is so- 
mehow dependent on the output (see Fig. 1.10b). If the value 
of the response y(t) is too low or too high, then the control 
action modifies the manipulated variable (shown as u(t)) ap- 
propriately. Such systems are designed to cope with lack of 
exact knowledge of system behavior, inaccurate component 
models and unexpected disturbances. Thus, increased accura- 
cy is achieved by reducing the sensitivity of the ratio of output 
to input to variations in system characteristics (i.e., increased 
bandwidth defined as the range of variation in the inputs over 
which the system will respond satisfactorily) or due to ran- 
dom perturbations of the system by the environment. They 
have a serious disadvantage though: they can inadvertently 
develop unstable oscillations; this issue is an important one 
by itself, and is treated extensively in control textbooks. 

Using the same example of a centrifugal pump but going 
one step further would lead us to the control of the pump. 
For example, if the inlet pressure p^ is specified, and the 
pump needs to be operated or controlled (i.e., say by varying 
its rotational speed s) under variable outlet pressure p, so 
as to maintain a constant fluid flow rate v, then some sort 
of control mechanism or feedback is often used (shown in 
Fig. 1.9c). The small circle at the intersection of the signal s 
and the feedback represents a summing point which denotes 
the algebraic operation being carried out. For example, if the 
feedback signal is summed with the signal s, a "H-" sign is 
placed just outside the summing point. Such graphical repre- 
sentations are called signal flow diagrams, and are used in 
process or system control which requires inverse modeling 
and parameter estimation. 



1 .3 Types of Problems in Mathematical 
Modeling 

1.3.1 Background 

Let us start with explaining the difference between para- 
meters and variables in a model. A deterministic model is 
a mathematical relationship, derived from physical consi- 
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Fig. 1.10 Open and closed loop 
systems for a controlled output 
y(t). a Open loop, b Closed loop 
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derations, between variables and parameters. The quantities 
in a model which can be measured independently during an 
experiment are the "variables" which can be either input or 
output variables (as described earlier). To formulate the rela- 
tionship among variables, one usually introduces "constants" 
which denote inherent properties of nature or of the enginee- 
ring system called parameters. Sometimes, the distinction 
between both is ambiguous and depends on the context, i.e. 
the objective of the study and the manner in which the expe- 
riment is performed. For example, in Eq. 1.1, pipe length has 
been taken to be a fixed system parameter since the intention 
was to study the pressure drop against fluid velocity. Howe- 
ver, if the objective is to determine the effect of pipe length 
on pressure drop for a fixed velocity, the length would then 
be viewed as the independent variable. 

Consider the dynamic model of a component or system 
represented by the block diagram in Fig. 1.11. For simpli- 
city, let us assume a linear model with no lagged terms in 
the forcing variables. Then, the model can be represented in 
matrix form as: 



where the output or state variable at time t is Y^. The forcing 
(or input or exogenous) variables are of two types: vector 
U denoting observable and controllable input variables, and 
vector W indicating uncontrollable input variables or distur- 
bing inputs. The parameter vectors of the model are { A, B, C} 
while d represents the initial condition vector. 

As shown in Fig. 1.12, one can differentiate between 
two broad types of problems; the forward (or well-defined 
or well-specified or direct) problem and the inverse (or ill- 
defined or identifiability) problem. The latter can, in turn, 
be divided into over-constrained (or over-specified or under- 
parameterized) and under-constrained (or under-specified or 
over-parameterized) problems which lead to calibration and 
model selection^ type of problems respectively. Both of these 
rely on parameter estimation methods using either calibrated 
white box models or grey-box or black-box model forms re- 
gressed to data. These types of problems and their interacti- 
ons are discussed at length in Chaps. 10 and 11, while a brief 
introduction is provided below. 



AY^_j + BU, + CW, 



with Y. 



(1.10) 1.3.2 Forward Problems 



w 




Fig. 1.11 Block diagram of a simple component with parameter vec- 
tors {A, B, C}. Vectors U and W are the controllable/observable and 
the uncontrollable/disturbing inputs respectively while Y is the state 
variable or system response 



Such problems are framed as one where: 
Given {U,W} and {B,C,d}, determine Y 



(1.11) 



' The term "system identification" is extensively used in numerous 
texts related to inverse problems (especially in electrical engineering) 
to denote model structure identification and/or estimating the model 
parameters. Different authors use it differently, and since two distinct 
aspects are involved, this does seem to create some confusion. Hence 
for clarity, this book tries to retain this distinction by explicitly using 
the terms "model selection" for the process of identifying the functional 
form or model structure, and "parameter estimation" for the process of 
identifying the parameters in the functional model. 
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Fig. 1.12 Different types of 
mathematical models used in 
forward and inverse approaches. 
The dotted line indicates that 
control problems often need 
model selection and parameter 
estimation as a first step 
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The objective is to predict the response or state variab- 
les of a specified model with known structure and known 
parameters when subject to specified input or forcing va- 
riables (Fig. 1.12). This is also refen^ed to as the "well-de- 
fined problem" since it has a unique solution if formulated 
properly. This is the type of models which is implicitly stu- 
died in classical mathematics and also in system simulation 
design courses. For example, consider a simple steady-state 
problem wherein the operating point of a pump and piping 
network are represented by black-box models of the pressure 
drop {Ap) and volumetric flow rate (V) such as shown in 
Fig. 1.13: 



Pump curve 



A/7 = fli + &i ■ V + ci ■ y2 

Ap = fl2 + ^2 ■ V' + C2 ■ y^ 



for the pump , , 

for the pipe network 




Volume flow rate (V) 

Fig. 1.13 Example of a forward problem where solving two simulta- 
neous equations, one representing the pump curve and the other the 
system curve, yields the operating point 



Solving the two equations simultaneously yields the per- 
formance conditions of the operating point, i.e., pressure 
drop and flow rate {Ap^^,V^. Note that the numerical values 
of the model parameters {a,b,c.} are known, and that {Ap) 
and V are the two variables, while the two equations provide 
the two constraints. This simple example has obvious exten- 
sions to the solution of differential equations where spatial 
and temporal response is sought. 

In order to ensure accuracy of prediction, the models have 
tended to become increasingly complex especially with the 
advent of powerful and inexpensive computing power. The 
divide and conquer mind-set is prevalent in this approach, of- 
ten with detailed mathematical equations based on scientific 
laws used to model micro-elements of the complete system. 
This approach presumes detailed knowledge of not only the 
various natural phenomena affecting system behavior but 
also of the magnitude of various interactions (for example, 
heat and mass transfer coefficients, friction coefficients, 
etc.). The main advantage of this approach is that the system 
need not be physically built in order to predict its behavior. 
Thus, this approach is ideal in the preliminary design and 
analysis stage and is most often employed as such. Note that 
incorporating superfluous variables and needless modeling 
details does increase computing time and complexity in the 
numerical resolution. However, if done correctly, it does not 
compromise the accuracy of the solution obtained. 
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1.3.3 Inverse Problems 

It is rather difficult to succinctly define inverse problems sin- 
ce they apply to different classes of problems with applica- 
tions in diverse areas, each with their own terminology and 
viewpoints (it is no wonder that it suffers from the "blind 
men and the elephant" syndrome). Generally speaking, in- 
verse problems are those which involve identification of mo- 
del structure (system identification) and/or estimates of mo- 
del parameters (further discussed in Sect. 1.6 and Chaps. 10 
and 11) where the system under study already exists, and one 
uses measured or observed system behavior to aid in the mo- 
del building and/or refinement. Different model forms may 
capture the data trend; this is why some argue that inverse 
problems are generally "ill-defined" or "ill-posed". 

In terms of mathematical classification'', there are three 
types of inverse models all of which require some sort of 
identification or estimation (Fig. 1.12): 
(a) calibrated forward models where ones uses a mechanis- 
tic model originally developed for the purpose of system 
simulation, and modifies or "tunes" the numerous mo- 
del parameters so that model predictions match obser- 
ved system behavior as closely as possible. Often, only 
a sub-set or limited number of measurements of system 
states and forcing function values are available, resul- 
ting in a highly over-parameterized problem with more 
than one possible solution (discussed in Sect. 11.2). 
Such inverse problems can be framed as: 
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Fig. 1 .14 Example of a parameter estimation problem where the model 
parameters of a presumed function of pressure drop versus volume flow 
rate are identified from discrete experimental data points 

or within the temporal and/or spatial range of input va- 
riables — in such cases, simple and well-known methods 
such as curve fitting may suffice (see Fig. 1.14); (ii) the 
intent is to subsequently predict system behavior outside 
the temporal and/or spatial range of input variables — in 
such cases, physically based models are generally requi- 
red, and this is influenced by the subsequent application 
of the model. Such problems (also referred to as system 
identification problems) are examples of under-parame- 
terized problems and can be framed as: 

given {Y, U, W, d}, determine {A,B,C} (1.13b) 



given {Y", U", W", d"}, determine {A",B",C"} 

(1.13a) 

where the " notation is used to represent limited measu- 
rements or reduced parameter set; 
(b) model selection and parameter estimation (using either 
grey-box or black-box models) where a suite of plausib- 
le model structures are formulated from basic scientific 
and engineering principles involving known influential 
and physically-relevant regressors, and performing expe- 
riments (or identifying system performance data) which 
allows these competing models to be evaluated and the 
"best" model identified. If a grey-box model is used, i.e., 
one which has physical meaning (such as the overall heat 
loss coefficient, time constant,...), it can then serve to 
improve our mechanistic understanding of the phenome- 
non or system behavior, and provide guidance as to ways 
by which the system behavior can be altered in a pre-spe- 
cified manner. Different models and parameter estimati- 
on techniques need to be adopted depending on whether: 
(i) the intent is to subsequently predict system behavi- 



(c) models for system control and diagnostics so as to iden- 
tify inputs necessary to produce a pre-specified system 
response, and for inferring boundary or initial conditi- 
ons. Such problems are framed as: 

given {Y"} and {A,B,C}, determine {U W, d} 

(1.13c) 

where Y" is meant to denote that only limited measu- 
rements may be available for the state variable. Such 
problems require context-specific approximate numeri- 
cal or analytical solutions for linear and non-linear pro- 
blems and often involve model selection and parameter 
estimation as well. The ill-conditioning i.e., the solu- 
tion is extremely sensitive to the data (see Sect. 10.2) 
is often due to the repetitive nature of the data collected 
while the system is under normal operation. There is 
a rich and diverse body of knowledge on such inverse 
methods and numerous texts books, monographs and 
research papers are available on this subject. Chapter 1 1 
address these problems at more length. 



^ Several authors define inverse methods as applicable uniquely to case 
(c), and simply use the terms calibrated simulation and system identifi- 
cation for the two other cases. 



Example 1.3.1: Simulation of a chiller. 

This example will serve to illustrate a simple application of 

calibrated simulation, but first, let us discuss the forward 
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Fig. 1.15 Schematic of the coolmg plant for Example 1.3.1 

problem. Consider an example of simulating a chilled water 
cooling plant consisting of the condenser, compressor and 
evaporator, as shown in Fig. 1.15'. We shall use rather simple 
black-box models for this example for easier comprehension 
of the underlying concepts. The steady-state cooling capaci- 
ty q (in kWt) and the compressor electric power draw P (in 
kWe) are function of the refrigerant evaporator temperature 
t_ and the refrigerant condenser temperature t in °C, and are 
supplied by the equipment manufacturer: 



qe 



and 



239.5 + 10.073r, - 0.109?;^ - 3.41r, 

- 0.00250f,2 - 0.2030f,r, 

+ Qmntetl - 0.000080005r3f3 



0.00820^?^ 



(1.14) 



2.634 - 0.3081?,, - 0.0030 l/,f + 1.066?^ 
0.00528f3 - 0.001 U,' "^ nnmn^,2. 



0.000567?,;?; 



0.000003 1?,^/^ 



0.000306/^?, 

f2,2 



(1.15) 



Further data has been provided: 

• water flow rates through the evaporator: m^=6.8 kg/s and 
in the condenser m =7.6 kg/s 

• thermal conductances of the evaporator: UA =30.6 kW/K 
and condenser UA =26.5 kW/K 

• and the inlet water temperature to the evaporator t = 10°C 
and that to the condenser t, =25°C 

h 

Another equation needs to be introduced for the heat re- 
jected at the condenser q (in kWt). This is simply given by a 
heat balance of the system (i.e., from the first law of thermo- 
dynamics) as: 

P 



qc = qe 



(1.16) 



The forward problem would entail determining the un- 
known values ofY={t,t,q ,P,q.]. Since there are five un- 
knowns, five equations are needed. In addition to the three 



equations above, two additional ones are needed. These are 
the heat balances on the refrigerant side (assuming to be chan- 
ging phase, and hence, is at a constant temperature) and the 
coolant water side of both the evaporator and the condenser: 



and 



— TngCp\l^ Ig) 



qc — mcCp{tc - th) 



1 — exp 



1 — exp 



UA, 

nieCp 

UA, 
nicCp 



(1.17) 



(1.18) 



where c is the specific heat of water=4.186 kJ/kg K. 
Solving the five equations results in: 

t, = 2.84°C, tc = 43.05°C, q, = 134.39 kW 



and 



P = 28.34 kW 



' From Stoecker (1989) by permission of McGraw-Hill. 



To summarize, the performance of the various equipment 
and their interaction have been represented by mathematical 
equations which allow a single solution set to be determined. 
This is the case of the well-defined forward problem adop- 
ted in system simulation and design studies. Let us discuss 
how the same system is also amenable to an inverse model 
approach. Consider the case when a cooling plant similar to 
that assumed above exists, and the facility manager wishes 
to instrument the various components in order to: (i) ver- 
ify that the system is performing adequately, and (ii) vary 
some of the operating variables so that the power consumed 
by the compressor is reduced. In such a case, the numeri- 
cal model coefficients given in Eqs. 1.14 and 1.15 will be 
unavailable, and so will be the UA values, since either he 
is unable to find the manufacturer-provided models or the 
equipment has degraded somewhat that the original models 
are no longer accurate. The model calibration will involve 
determining these values from experiment data gathered by 
appropriately sub-metering the evaporator, condenser and 
compressor on both the refrigerant and the water coolant 
side. How best to make these measurements, how accurate 
should the instrumentation be, what should be the sampling 
frequency, for how long should one monitor,... are all issues 
which fall within the purview of design of field monitoring. 
Uncertainty in the measurements as well as the fact that the 
assumed models are approximations of reality will introduce 
model predictions errors and so the verification of the actual 
system against measured performance will have to consider 
such aspects properly. 

The above example was a simple one with explicit alge- 
braic equations for each component with no feedback loops. 
Detailed simulation programs are much more complex (with 
hundreds of variables, complex boundary conditions,...) in- 
volving ODEs or PDEs; one example is computational fluid 
dynamic (CFD) models for indoor air quality studies. Calibrat- 
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ing such models is extremely difficult given the lack of proper 
instrumentation which can provide detailed spatial and tempo- 
ral measurement fields, the inability to conveniently compart- 
mentalize the problem so that inputs and outputs of sub-blocks 
could be framed and calibrated individually as done in the co- 
oling plant example above. Thus, in view of such limitations 
in the data, developing a simpler system model consistent with 
the data available while retaining the underlying mechanistic 
considerations as far as possible is a more appealing approach; 
albeit a challenging one — such an approach is shown under 
the "model selection" branch in Fig. 1.12. 

Example 1.3.2: Dose-response models. An example of how 
inverse models differ from a straightforward curve fit is gi- 
ven below (the same example is treated at much more depth 
in Sects. 10.4.4 and 11.3.4). Consider the case of models of 
risk to humans when exposed to toxins (or biological poi- 
sons) which are extremely deadly even in small doses. Dose 
is the total mass of toxin which the human body ingests. Re- 
sponse is the measurable physiological change in the body 
produced by the toxin which can have many manifestations; 
but let us focuses on human cells becoming cancerous. Since 
different humans (and test animals) react differently to the 
same dose, the response is often interpreted as a probability 
of cancer being induced, which can be framed as a risk. Fur- 
ther, tests on lab test animals are usually done at relatively 
high levels while policy makers would want to know the hu- 
man response under lower levels of dose. Not only does one 
have the issue of translating lab specimen results to human 
response, but also one needs to be able to extrapolate the 
model to low doses. The manner one chooses to extrapolate 
the dose-response curve downwards is dependent on either 
the assumption one makes regarding the basic process itself 
or how one chooses to err (which has policy-making implica- 
tions). For example, erring too conservatively in terms of risk 
would overstate the risk and prompt implementation of more 
precautionary measures, which some critics would fault as 
unjustified and improper use of limited resources. 

Figure 1.16 illustrates three methods of extrapolating 
dose-response curves down to low doses (Heinsohn and 
Cimbala 2003). The dots represent observed laboratory tests 
performed at high doses. Three types of models are fit to 
the data and all of them agree at high doses. However, they 
deviate substantially at low doses because the models are 
functionally different. While model I is a nonlinear model 
applicable to highly toxic agents, curve II is generally ta- 
ken to apply to contaminants that are quite harmless as low 
doses (i.e., the body is able to metabolize the toxin at low 
doses). Curve III is an intermediate one between the other 
two curves. The above models are somewhat empirical (or 
black-box) and are useful as performance models. However, 
they provide little understanding of the basic process itself. 
Models based on simplified but phenomenological conside- 



0) 
DC 




Dose 

Fig. 1.16 Three different inverse models depending on toxin type for 
extrapolating dose-response observations at high doses to the response 
at low doses. (From Heinsohn and Cimbala (2003) by permission of 
CRC Press) 



rations of how biological cells become cancerous have also 
been developed and these are described in Sect. 1 1.3. 

There are several aspects to this problem relevant to in- 
verse modeling: (i) can the observed data of dose versus re- 
sponse provide some insights into the process which induces 
cancer in biological cells? (ii) How valid are these results 
extrapolated down to low doses? (iii) Since laboratory tests 
are performed on animal subjects, how valid are these results 
when extrapolated to humans? There are no simple answers 
to these queries (until the basic process itself is completely 
understood). Probability is bound to play an important role 
to the nature of the process, and hence, the adoption of va- 
rious agencies (such as the U.S. Environmental Protection 
Agency) of probabilistic methods towards risk assessment 
and modeling. 



1 .4 What is Data Analysis? 

In view of the diversity of fields to which data analysis is 
applied, an all-encompassing definition would have to be 
general. One good definition is: "an evaluation of collected 
observations so as to extract information useful for a speci- 
fic purpose". The evaluation relies on different mathematical 
and statistical tools depending on the intent of the investi- 
gation. In the area of science, the systematic organization 
of observational data, such as the orbital movement of the 
planets, provided a means for Newton to develop his laws of 
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motion. Observational data from deep space allow scientists 
to develop/refine/verify theories and hypotheses about the 
structure, relationships, origins, and presence of certain phe- 
nomena (such as black holes) in the cosmos. At the other end 
of the spectrum, data analysis can also be viewed as simply: 
''the process of systematically applying statistical and logi- 
cal techniques to describe, summarize, and compare data". 
From the perspective of an engineer/scientist, data analysis 
is a process which when applied to system performance data, 
collected either intrusively or non-intrusively, allows certain 
conclusions about the state of the system to be drawn, and 
thereby, to initiate followup actions. 

Studying a problem through the use of statistical data ana- 
lysis usually involves four basic steps (Arsham 2008): 

(a) Defining tlie Problem: The context of the problem and 
the exact definition of the problem being studied need to be 
framed. This allows one to design both the data collection 
system and the subsequent analysis procedures to be follo- 
wed. 

(b) Collecting the Data: In the past (say, 50 years back), 
collecting the data was the most difficult part, and was often 
the bottleneck of data analysis. Nowadays, one is overwhel- 
med by the large amounts of data resulting from the great 
strides in sensor and data collection technology; and data 
cleaning, handling, summarizing have become major issues. 
Paradoxically, the design of data collection systems has been 
marginalized ''by an apparent belief that extensive computa- 
tion can make up for any deficiencies in the design of data 
collection" . Gathering data without a clear definition of the 
problem often results in failure or limited success. Data can 
be collected from existing sources or obtained through ob- 
servation and experimental studies designed to obtain new 
data. In an experimental study, the variable of interest is 
identified. Then, one or more factors in the study are con- 
trolled so that data can be obtained about how the factors 
influence the variables. In observational studies, no attempt 
is made to control or influence the variables of interest either 
intentionally or due to the inability to do so (two examples 
are surveys and astronomical data). 

(c) Analyzing the Data: There are various statistical and 
analysis approaches and tools which one can bring to bear 
depending on the type and complexity of the problem and 
the type, quality and completeness of the data available. Sec- 
tion 1.6 describes several categories of problems encoun- 
tered in data analysis. Probability is an important aspect of 
data analysis since it provides a mechanism for measuring, 
expressing, and analyzing the uncertainties associated with 
collected data and mathematical models used. This, in turn, 
impacts the confidence in our analysis results: uncertainty in 
future system performance predictions, confidence level in 



our 



uui confirmatory conclusions, uncertainty in the validity of 
the action proposed,. .. . The majority of the topics addressed 
in this book pertain to this category. 

(d) Reporting the Results: The final step in any data analy- 
sis effort involves preparing a report. This is the written do- 
cument that logically describes all the pertinent stages of the 
work, presents the data collected, discusses the analysis re- 
sults, states the conclusions reached, and recommends further 
action specific to the issues of the problem identified at the 
onset. The final report and any technical papers resulting from 
it are the only documents which survive over time and are 
invaluable to other professionals. Unfortunately, the task of 
reporting is often cursory and not given its due importance. 

Recently, the term "intelligent" data analysis has been 
used which has a different connotation from traditional ones 
(Berthold and Hand 2003). This term is used not in the sense 
that it involves added intelligence of the user or analyst in 
applying traditional tools, but that the statistical tools them- 
selves have some measure of intelligence built into them. A 
simple example is when a regression model has to be identi- 
fied from data. The tool evaluates hundreds of built-in functi- 
ons and presents to the user a prioritized list of models accor- 
ding to their goodness-of-fit. The recent evolution of com- 
puter-intensive methods (such as bootstrapping and Monte 
Carlo methods) along with soft computing algorithms (such 
as artificial neural networks, genetic algorithms,. . .) enhance 
the capability of traditional statistics, model estimation, and 
data analysis methods. These added capabilities of enhanced 
computational power of modern-day computers and the so- 
phisticated manner in which the software programs are writ- 
ten allow "intelligent" data analysis to be performed. 



1 .5 Types of Uncertainty in Data 

If the same results are obtained when an experiment is repea- 
ted under the same conditions, one says that the experiment 
is deterministic. It is this deterministic nature of science that 
allows theories or models to be formulated and permits the 
use of scientific theory for prediction (Hodges and Lehman 
1970). However, all observational or experimental data in- 
variably have a certain amount of inherent noise or random- 
ness which introduces a certain degree of uncertainty in the 
results or conclusions. Due to instrument or measurement 
technique, or improper understanding of all influential fac- 
tors, or the inability to measure some of the driving para- 
meters, random and/or bias types of errors usually infect 
the deterministic data. However, there are also experiments 
whose results vary due to the very nature of the experiment; 
for example gambling outcomes (throwing of dice, card ga- 
mes,. . .). These are called random experiments. Without un- 
certainty or randomness, there would have been little need 
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for statistics. Probability theory and inferential statistics 
have been developed to deal with random experiments and 
the same approach has also been adapted to deterministic 
experimental data analysis. Both inferential statistics and 
stochastic model building have to deal with the random na- 
ture of observational or experimental data, and thus, require 
knowledge of probability. 

There are several types of uncertainty in data, and all of 
them have to do with the inability to "determine the true state 
of affairs of a system" (Haimes 1998). A succinct classifica- 
tion involves the following sources of uncertainties: 

(a) purely stochastic variability (or aleatory uncertainty) 
where the ambiguity in outcome is inherent in the na- 
ture of the process, and no amount of additional measu- 
rements can reduce the inherent randomness. Common 
examples involve coin tossing, or card games. These 
processes are inherently random (either on a temporal 
or spatial basis), and whose outcome, while uncertain, 
can be anticipated on a statistical basis; 

(b) epistemic uncertainty or ignorance or lack of comple- 
te knowledge of the process which result in certain in- 
fluential variables not being considered (and, thus, not 
measured); 

(c) inaccurate measurement of numerical data due to in- 
strument or sampling errors; 

(d) cognitive vagueness involving human linguistic de- 
scription. For example, people use words like tall/short 
or very important/not important which cannot be quan- 
tified exactly. This type of uncertainty is generally as- 
sociated with qualitative and ordinal data where subjec- 
tive elements come into play. 

The traditional approach is to use probability theory along 
with statistical techniques to address (a), (b), and (c) types of 
uncertainties. The variability due to sources (b) and (c) can 
be diminished by taking additional measurements, by using 
more accurate instrumentation, by better experimental de- 
sign and acquiring better insight into specific behavior with 
which to develop more accurate models. Several authors ap- 
ply the term "uncertainty" to only these two sources. Final- 
ly, source (d) can be modeled using probability approaches 
though some authors argue that it would be more convenient 
to use fuzzy logic to model this vagueness in speech. 



1 .6 Types of Applied Data Analysis 
and Modeling Methods 

Such methods can be separated into the following groups de- 
pending on the intent of the analysis: 

(a) Exploratory data analysis and descriptive statistics, 
which entails performing "numerical detective work" 
on the data and developing methods for screening, or- 
ganizing, summarizing and detecting basic trends in the 
data (such as graphs, and tables) which would help in 



information gathering and knowledge generation. His- 
torically, formal statisticians have shied away from ex- 
ploratory data analysis considering it to be either too 
simple to warrant serious discussion or too ad hoc in 
nature to be able to expound logical steps (McNeil 
1977). This area had to await the pioneering work by 
John Tukey and others to obtain a formal structure. This 
area is not specifically addressed in this book, and the 
interested reader can refer to Hoagin et al. (1983) or 
Tukey (1988) for an excellent perspective. 

(b) Model building and point estimation which involves 
(i) taking measurements of the various parameters (or 
regressor variables) affecting the output (or response 
variables) of a device or a phenomenon, (ii) identify- 
ing a causal quantitative correlation between them by 
regression, and (iii) using it to make predictions about 
system behavior under future operating conditions. 
There is a rich literature in this area with great diversity 
of techniques and level of sophistication. 

(c) Inferential problems are those which involve making 
uncertainty inferences or calculating uncertainty or con- 
fidence intervals of population estimates from selected 
samples. They also apply to regression, i.e., uncertainty 
in model parameters, and in model predictions. When a 
regression model is identified from data, the data cannot 
be considered to include the entire "population" data, i.e., 
all the observations one could possibly conceive. Hence, 
model parameters and model predictions suffer from 
uncertainty which needs to be quantified. This takes the 
form of assigning uncertainty bands around the estima- 
tes. Those methods which allow tighter predictions are 
deemed more "efficient", and hence more desirable. 

(d) Design of experiments is the process of prescribing the 
exact manner in which samples for testing need to be 
selected, and the conditions and sequence under which 
the testing needs to be performed such that the relati- 
onship or model between a response variable and a set 
of regressor variables can be identified in a robust and 
accurate manner. 

(e) Classification and clustering problems: Classification 
problems are those where one would like to develop a 
model to statistically distinguish or "discriminate" dif- 
ferences between two or more groups when one knows 
beforehand that such groupings exist in the data set pro- 
vided, and, to subsequently assign, allocate or classify 
a future unclassified observation into a specific group 
with the smallest probability of error. Clustering, on 
the other hand, is a more difficult problem, involving 
situations when the number of clusters or groups is not 
known beforehand, and the intent is to allocate a set of 
observation sets into groups which are similar or "clo- 
se" to one another with respect to certain attribute(s) or 
characteristic(s). 
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(f) Time series analysis and signal processing. Time series 
analysis involves the use of a set of tools that include 
traditional model building techniques as well as those 
involving the sequential behavior of the data and its 
noise. They involve the analysis, interpretation and ma- 
nipulation of time series signals in either time domain 
or frequency domain. Signal processing is one speci- 
fic, but important, sub-domain of time series analysis 
dealing with sound, images, biological signals such as 
ECG, radar signals, and many others. Vibration analysis 
of rotating machinery is another example where signal 
processing tools can be used. 

(g) Inverse modeling (introduced earlier in Sect. 1.3.3) is an 
approach to data analysis methods which includes three 
classes: statistical calibration of mechanistic models, 
model selection and parameter estimation, and infer- 
ring forcing functions and boundary /initial conditions. It 
combines the basic physics of the process with statistical 
methods so as to achieve a better understanding of the 
system dynamics, and thereby use it to predict system 
performance either within or outside the temporal and/or 
spatial range used to develop the model. The discipline 
of inverse modeling has acquired a very important niche 
not only in the fields of engineering and science but in 
other disciplines as well (such as biology, medicine,. . .). 

(h) Risk analysis and decision making: Analysis is often a 
precursor to decision-making in the real world. Along 
with engineering analysis there are other aspects such 
as making simplifying assumptions, extrapolations into 
the future, financial ambiguity,... that come into play 
while making decisions. Decision theory is the study 
of methods for arriving at "rational" decisions under 
uncertainty. The decisions themselves may or may not 
prove to be correct in the long term, but the process 
provides a structure for the overall methodology by 
which undesirable events are framed as risks, the chain 
of events simplified and modeled, trade-offs between 
competing alternatives assessed, and the risk attitude of 
the decision-maker captured (Clemen and Reilly 2001). 
The value of collecting additional information to redu- 
ce the risk, capturing heuristic knowledge or combining 
subjective preferences into the mathematical structure 
are additional aspects of such problems. As stated ear- 
lier, inverse models can be used to make predictions ab- 
out system behavior. These have inherent uncertainties 
(which may be large or small depending on the problem 
at hand), and adopting a certain inverse model over 
potential competing ones involves the consideration of 
risk analysis and decision making tools. 



1 .7 Example of a Data Collection 
and Analysis System 

Data can be separated into experimental or observational de- 
pending on whether the system operation can be modified 
by the observer or not. Consider a system where the initial 
phase of designing and installing the monitoring system is 
complete. Figure 1.17 is a flowchart depicting various stages 
in the collection, analysis and interpretation of data collected 
from an engineering thermal* system while in operation. The 
various elements involved are: 

(a) a measurement system consisting of various sensors of 
pre-specified types and accuracy. The proper location, 
commissioning and maintenance of these sensors are 
important aspects of this element; 

(b) data sampling element whereby the output of the va- 
rious sensors are read at a pre-determined frequency. 
The low cost of automated data collection has led to 
increasingly higher sampling rates. Typical frequencies 
for thermal systems are in the range of 1 s-1 min; 

(c) clean raw data for spikes, gross errors, mis-recordings, 
and missing or dead channels, average (or sum) the data 
samples and, if necessary, store them in a dynamic fa- 
shion (i.e., online) in a central electronic database with 
an electronic time stamp; 

(d) average raw data and store in a database; typical periods 
are in the range of 1-30 min. One can also include some 
finer checks for data quality by flagging data when they 
exceed physically stipulated ranges. This process need 
not be done online but could be initiated automatically 
and periodically, say, every day. It is this data set which 
is queried as necessary for subsequent analysis; 

(e) The above steps in the data collection process are per- 
formed on a routine basis. This data can be used to ad- 
vantage, provided one can frame the issues relevant to 
the client and determine which of these can be satisfied. 
Examples of such routine uses are assessing overall 
time-averaged system efficiencies and preparing weekly 
performance reports, as well as for subtler action such 
as supervisory control and automated fault detection; 

(f) Occasionally the owner would like to evaluate major 
changes such as equipment change out or addition of 
new equipment, or would like to improve overall sys- 
tem performance or reliability not knowing exactly how 
to achieve this. Alternatively, one may wish to evaluate 
system performance under an exceptionally hot spell 
of several days. This is when specialized consultants 
are brought in to make recommendations to the owner. 
Historically, such analysis were done based on the pro- 



* Electrical systems have different considerations since they mostly use 
very high frequency sampling rates. 



1.7 Example of a Data Collection and Analysis System 



21 



Fig. 1.17 Flowchart depicting 
various stages in data analysis 
and decision making as applied 
to continuous monitoring of 
thermal systems 
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fessional expertise of the consultant with minimal or 
no measurements of the actual system. However, both 
financial institutions who would lend the money for im- 
plementing these changes or the upper management of 
the company owning the system are insisting on a more 
transparent engineering analysis based on actual data. 
Hence, the preliminary steps involving relevant data ex- 
traction and a more careful data proofing and validation 
are essential; 

Extracted data are then subject to certain engineering 
analyses which can be collectively referred to as data- 
driven modeling and analysis. These involve statistical 



inference, identifying patterns in the data, regression 
analysis, parameter estimation, performance extrapo- 
lation, classification or clustering, deterministic mode- 
ling,... 
(h) Performing a decision analyses, in our context, invol- 
ves using the results of the engineering analyses and 
adding an additional layer of analyses that includes 
modeling uncertainties (involving among other issues 
a sensitivity analysis), modeling stakeholder preferen- 
ces and structuring decisions. Several iterations may be 
necessary between this element and the ones involving 
engineering analysis and data extraction; 



22 



1 Mathematical Models and Data Analysis 



(i) the various choices suggested by the decision analysis 
are presented to the owner or decision-maker so that a 
final course of action may be determined. Sometimes, 
it may be necessary to perform additional analyses or 
even modify or enhance the capabilities of the measure- 
ment system in order to satisfy client needs. 



1 .8 Decision Analysis and Data Mining 

The primary objective of this book is to address element (g) 
and to some extent element (h) described in the previous sec- 
tion. However, data analysis is not performed just for its own 
sake; its usefulness lies in the support it provides to such 
objectives as gaining insight about system behavior which 
was previously unknown, characterizing current system per- 
formance against a baseline, deciding whether retrofits and 
suggested operational changes to the system are warranted or 
not, quantifying the uncertainty in predicting future behavior 
of the present system, suggesting robust/cost effective/risk 
averse ways to operate an existing system, avoiding catas- 
trophic system failure, etc. . . 

There are two disciplines with overlapping/complementa- 
ry aims to that of data analysis and modeling which are di- 
scussed briefly so as to provide a broad contextual basis to the 
reader. The first deals with decision analysis stated under ele- 
ment (h) above whose objective is to provide both an overall 
paradigm and a set of tools with which decision makers can 
construct and analyze a model of a decision situation (Cle- 
men and Reilly 2001). Thus, though it does not give speci- 
fic answers to problems faced by a person, decision analysis 
provides a structure, guidance and analytical tools on how to 
logically and systematically tackle a problem, model uncer- 
tainty in different ways, and hopefully arrive at rational deci- 
sions in tune with the personal preferences of the individual 
who has to live with the choice(s) made. While it is applicable 
to problems without uncertainty but with multiple outcomes, 
its strength lies in being able to analyze complex multiple 
outcome problems that are inherently uncertain or stochastic 
compounded with the utility functions or risk preferences of 
the decision-maker. There are different sources of uncertainty 
in a decision process but the one pertinent to data modeling 
and analysis in the context of this book is that associated with 
fairly well behaved and well understood engineering systems 
with relatively low uncertainty in their performance data. 
This is the reason why historically, engineering students were 
not subjected to a class in decision analysis. However, many 
engineering systems are operated wherein the attitudes and 
behavior of people operating these systems assume importan- 
ce; in such cases, there is a need to adapt many of the decision 
analysis tools and concepts with traditional data analysis and 
modeling techniques. This issue is addressed in Chap. 12. 

The second discipline is data mining which is defined as 
the science of extracting useful information from large/enor- 



mous data sets. Though it is based on a range of techniques, 
from the very simple to the sophisticated (involving such 
methods as clustering techniques, artificial neural networks, 
genetic algorithms,. . .), it has the distinguishing feature that it 
is concerned with shifting through large/enormous amounts 
of data with no clear aim in mind except to discern hidden 
information, discover patterns and trends, or summarize data 
behavior (Dunham 2003). Thus, not only does its distincti- 
veness lie in the data management problems associated with 
storing and retrieving large amounts of data from perhaps 
multiple datasets, but also in it being much more explorato- 
ry and less formalized in nature than is statistics and model 
building where one analyzes a relatively small data set with 
some specific objective in mind. Data mining has borrowed 
concepts from several fields such as multivariate statistics 
and Bayesian theory, as well as less formalized ones such as 
machine learning, artificial intelligence, pattern recognition, 
and data management so as to bound its own area of study 
and define the specific elements and tools involved. It is the 
result of the digital age where enormous digital databases 
abound from the mundane (supermarket transactions, credit 
cards records, telephone calls, internet postings,...) to the 
very scientific (astronomical data, medical images,. . .). Thus, 
the purview of data mining is to explore such data bases in 
order to find patterns or characteristics (called data discove- 
ry) or even in response to some very general research ques- 
tion not provided by any previous mechanistic understanding 
of the social or engineering system, so that some action can 
be taken resulting in a benefit or value to the owner. Data 
mining techniques are not discussed in this book except for 
those data analysis and modeling issues which are common 
to both disciplines. 



1 .9 Structure of Book 

The overall structure of the book is depicted in Table 1.3 
along with a simple suggestion as to how this book could 
be used for two courses if necessary. This chapter (Chap. 1) 
has provided a general introduction of mathematical models, 
and discussed the different types of problems and analysis 
tools available for data driven modeling and analysis. Chap- 
ter 2 reviews basic probability concepts (both classical and 
Bayesian), and covers various important probability distri- 
butions with emphasis as to their practical usefulness. Chap- 
ter 3 reviews rather basic material involving data collection, 
and preliminary tests within the purview of data validation. 
It also presents various statistical measures and graphical 
plots used to describe and scrutinize the data, data errors and 
their propagation. Chapter 4 covers statistical inference such 
as hypotheses testing, and ANOVA, as well as non-parame- 
tric tests and sampling and re-sampling methods. A brief 
treatment of Bayesian inference is also provided. Parame- 
ter estimation using ordinary least squares (OLS) involving 
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Table 1 .3 Analysis methods co- 
vered in this book and suggested 
curriculum for two courses 



Chapter 


Topic 


First course 


Second course 


1 


Introduction: Mathematical models and data-driven methods 


X 


X 


2 


Probability and statistics, important probability distributions 


X 




3 


Exploratory data analysis and descriptive statistics 


X 




4 


Inferential statistics, non-parametric tests and sampling 


X 




5 


OLS regression, residual analysis, point and interval estimation 


X 




6 


Design of experiments 


X 




7 


Traditional optimization methods and dynamic programming 




X 


8 


Classification and clustering analysis 




X 


9 


Time series analysis, ARIMA, process monitoring and control 




X 


10 


Parameter estimation methods 




X 


11 


Inverse methods (calibration, system identification, control) 




X 


12 


Decision-making and risk analysis 




X 



single and multi-linear regression is treated in Chap. 5. Re- 
sidual analysis, detection of leverage and influential points 
are also discussed. The material from all these four chapters 
(Chaps. 2-5) is generally covered in undergraduate statistics 
and probability classes, and is meant as review or refresher 
material (especially useful to the general practitioner). Nu- 
merous practically-framed examples and problems along 
with real-world case study examples using actual monitored 
data are assembled pertinent to energy and environmental is- 
sues and equipment (such as solar collectors, pumps, fans, 
heat exchangers, chillers. . .). Chapter 6 covers basic classical 
concepts of experimental design methods, and discusses fac- 
torial and response surface methods which allow extending 
hypothesis testing to multiple variables as well as identifying 
sound performance models. 

Chapter 7 covers traditional optimization methods inclu- 
ding dynamic optimization methods. Chapter 8 discusses 
the basic concepts and some of the analysis methods which 
allow classification and clustering tasks to be performed. 
Chapter 9 introduces several methods to smooth time series 
data analyze time series data in the time domain and to deve- 
lop forecasting models using both the OLS modeling appro- 
ach and the ARMA class of models. An overview is also pro- 
vided of control chart techniques extensively used for pro- 
cess control and condition monitoring. Chapter 10 discusses 
subtler aspects of parameter estimation such as maximum 
likelihood estimation, recursive and weighted least squares, 
robust-fitting techniques, dealing with collinear regressors 
and error in x models. Computer intensive methods such as 
bootstrapping are also covered. Chapter 1 1 presents an over- 
view of the types of problems which fall under inverse mo- 
deling: control problems which include inferring inputs and 
boundary conditions, calibration of white box models and 
complex linked models requiring computer programs, and 
system identification using black-box (such as neural net- 
works) and grey-box models (state-space formulation). Illus- 
trative examples are provided in each of these cases. Finally, 
Chap. 12 covers basic notions relevant to and involving the 
disciplines of risk analysis and decision-making, and reinfor- 
ces these by way of examples. It also describes various facets 



such as framing undesirable events as risks, simplifying and 
modeling chain of events, assessing trade-offs between com- 
peting alternatives, and capturing the risk attitude of the de- 
cision-maker. The value of collecting additional information 
to reduce the risk is also addressed. 



Problems 

Pr. 1.1 Identify which of the following functions are linear 
models, which are linear in their parameters (a, b, c) and 
which are both: 

(a) y — a + bx + ex 

/ux t> c 

(b) y^a + - + — 

X x-^ 



(c) y ^a + b{x - 1) + c(x - If 



(d) 3; = (flo + boxi + Cflxf) -I- (d + bixi + Cix\)x2 



(e) y — a +b. sin (c -I- x) 

(f) y — a + b m\ (ex) 

(g) y ^a + fox" 
(h) y=a+fox'-^ 
(i) y — a + b e" 



Pr. 1.2 Recast Eq. 1.1 such that it expresses the fluid volume 
flow rate (rather than velocity) in terms of pressure drop and 
other quantities. Draw a block diagram to represent the case 
when a feedback control is used to control the flow rate from 
measured pressure drop. 

Pr. 1.3 Consider Eq. 1 .4 which is a lumped model of a fully- 
mixed hot water storage tank. Assume initial temperature is 



24 



1 Mathematical Models and Data Analysis 



APi = (2.1x10^°)(Fi)2 



Fi 



AP2 = (3.6x10^°)(F2)2 



(^ 



T , = 60°C while the ambient temperature is constant at 

20°C. 

(i) Deduce the expression for the time constant of the tank 
in terms of model parameters. 

(ii) Compute its numerical value when Mc =9.0 MJ/°C and 
C/A=0.833 kW/°C. 

(iii) What will be the storage tank temperature after 6 h un- 
der cool-down. 

(iv) How long will the tank temperature take to drop to 
40°C. 

(v) Derive the solution for the transient response of the sto- 
rage tank under electric power input P. 

(vi) If 7^=50 kW, calculate and plot the response when the Fig. 1.18 Pumping system with two pumps in parallel 
tank is initially at 30°C (akin to Fig. 1.7). 



F = 0.01 m^/s 



Biot number Bi = 



<0.1- Assume that the external 



Pr. 1.4 The first order model of a measurement system is 
given by Eq. 1 .8. Its solution for a step change in the variable 
being measured results in Eq. 1 .9 which is plotted in Fig. 1.7. 
Derive an analogous model and plot the behavior for a steady 
sinusoidal variation in the input quantity: 

q{t)=A. sinCvv?) where A is the amplitude and w the fre- 
quency. 

Pr. 1.5 Consider Fig. 1.4 where a heated sphere is being 
cooled. The analysis simplifies considerably if the sphere 
can be modeled as a lumped one. This can be done if the 
hLf, 

heat transfer coefficient is 10 W/m-°C and that the radius of 

the sphere is 15 cm. The equivalent length of the sphere is 

Volume ^ . , , , , 

L(. = . Determine whether the lumped model 

Surface area 

assumption is appropriate for spheres made of the following 
materials: 

(a) Steel with thermal conductivity k = 34 W/m K 

(b) Copper with thermal conductivity k = 340 W/m K. 

(c) Wood with thermal conductivity k = 0. 15 W/m K 

Pr. 1.6 The thermal network representation of a homoge- 
neous plane is illustrated in Fig. 1.5. Draw the 3R2 C net- 
work representation and derive expressions for the three 
resistors and the two capacitors in terms of the two air film 
coefficients and the wall properties (Hint: follow the appro- 
ach illustrated in Fig. 1.5 for the 2R1C network). 



destination. The pressure drops in Pascals (Pa) of each 
network are given by: Api — (2.1) ■ 10'° • F^ and 
A/72 = (3.6) ■ 10'° ■ F2 where F^ and F, are the flow 
rates through each branch in mVs. Assume that pumps 
and their motor assemblies have the same efficiency. 
Let P| and P^ be the electric power in Watts (W) consu- 
med by the two pump-motor assemblies, 
(i) Sketch the block diagram for this system with total 

electric power as the output variable, 
(ii) Frame the total power P as the objective function 
which needs to be minimized against total delive- 
red water F, 
(iii) Solve the problem for F^ and P^ and P. 
(b) Inverse problem: Now consider the same system in the 
inverse framework where one would instrument the 
existing system such that operational measurements of 
P for different F^ and F, are available. 
(i) Frame the function appropriately using insights 
into the functional form provided by the forward 
model, 
(ii) The simplifying assumption of constant efficiency 
of the pumps is unrealistic. How would the above 
function need to be reformulated if efficiency can 
be taken to be a quadratic polynomial (or black- 
box model) of flow rate as shown below for the first 
piping branch (with a similar expression applying 
for the second branch): 

r]\—a\+b\-Fi+cx- F^ 



Pr. 1.7 Two pumps in parallel problem viewed from the for- 
ward and the inverse perspectives 

Consider Fig. 1.18 which will be analyzed in both the for- 
ward and data driven approaches. 

(a) Forward problem^: Two pumps with parallel networks 
deliver F=0.01 mVs of water from a reservoir to the 



Pr. 1.8 Lake contamination problem viewed from the for- 
ward and the inverse perspectives 

A lake of volume V is fed by an incoming stream with 
volumetric flow rate Q_, and contaminated with concentra- 
tion C '" (Fig. 1.19). The outfall of another source (say, the 
sewage from a factory) also discharges a flow Q of the same 



' From Stoecker (1989) by permission of McGraw-Hill. 



" From Masters and Ela (2008) by permission of Pearson Educa- 
tion. 
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Contaminated 
outfall 



Incoming 
stream 



Q3 = 5.0 m^/s 
C3 = 10.0mg/L 



Q„ =0.5 m3/s 
C„ =100.0 mg/L 



V = 10.0x10^m3 
k = 0.20/day 
C = ? 



Outgoing 
stream 



Qm = ? m^/s 
Cm = ? mg/L 



Fig. 1.19 Perspective of the forward problem for the lake contamina- 
tion situation 

pollutant with concentration C . The wastes in the stream 
and sewage have a decay coefficient k. 

(a) Let us consider the forward model approach. In order to 
simplify the problem, the lake will be considered to be a 
fully mixed compartment and evaporation and seepage 
losses to the lake bottom will be neglected. In such a 
case, the concentration of the outflow is equal to that in 
the lake, i.e., C =C. Then, the steady-state concentra- 
tion in the lake can be determined quite simply: Input 
rate = Output rate + decay rate 

where Input rate = QC+Q^C^,, Output rate = g,_C,_ = 
(6j + 6„)C„,, and decay rate =kCV. This results in: 

~ Qs + Q.. + kv 

Verify the above derived expression, and also check that 
C = 3.5 mg/L when the numerical values for the various 
quantities given in Fig. 1.19 are used. 

(b) Now consider the inverse control problem when an ac- 
tual situation can be generally represented by the model 
treated above. One can envision several scenarios; let us 
consider a simple one. Flora and fauna downstream of 
the lake have been found to be adversely affected, and 
an environmental agency would like to investigate this 
situation by installing appropriate instrumentation. The 
agency believes that the factory is polluting the lake, 
which the factory owner, on the other hand, disputes. 
Since it is rather difficult to get a good reading of spa- 
tial averaged concentrations in the lake, the experimen- 
tal procedure involves measuring the cross-sectionally 
averaged concentrations and volumetric flow rates of 
the incoming, outgoing and outfall streams. 

(i) Using the above model, describe the agency's 
thought process whereby they would conclude that 
indeed the factory is the major cause of the pollution. 

(ii) Identify arguments that the factory owner can raise 
to rebut the agency's findings. 

Pr. 1.9 The problem addressed above assumed that only one 
source of contaminant outfall was present. Rework the pro- 



blem assuming two sources of outfall with different volume- 
tric flows and concentration levels. 
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Probability Concepts and Probability 
Distributions 



This chapter reviews basic notions of probabihty (or "sto- 
chastic variability") which is the formal study of the laws 
of chance, i.e., where the ambiguity in outcome is inherent 
in the nature of the process itself. Both the primary views 
of probability, namely the frequentist (or classical) and the 
Bayesian, are covered, and some of the important probabi- 
lity distributions are presented. Finally, an effort is made to 
explain how probability is different from statistics, and to 
present different views of probability concepts such as ab- 
solute, relative and subjective probabilities. 



2.1 Introduction 

2.1 .1 Outcomes and Simple Events 

A random variable is a numerical description of the outcome 
of an experiment whose value depends on chance, i.e., whose 
outcome is not entirely predictable. Tossing a dice is a ran- 
dom experiment. There are two types of random variables: 
(i) discrete random variable is one that can take on only a 

finite or countable number of values, 
(ii) continuous random variable is one that may take on any 

value in an interval. 
The following basic notions relevant to the study of pro- 
bability apply primarily to discrete random variables. 

• Outcome is the result of a single trial of a random experi- 
ment. It cannot be decomposed into anything simpler. For 
example, getting a {2} when a dice is rolled. 

• Sample space (some refer to it as "universe") is the set of 
all possible outcomes of a single trial. For the rolling of a 
dice, the sample space is S= { 1, 2, 3, 4, 5, 6}. 

• Event is the combined outcomes (or a collection) of one 
or more random experiments defined in a specific man- 
ner. For example, getting a pre-selected number (say, 4) 
from adding the outcomes of two dices would constitute a 
simple event: A = { 4 } . 

• Complement of a event is the set of outcomes in the samp- 
le not contained in A. A = {2, 3, 5, 6, 7, 8, 9, 10, 1 1, 12} is 
the complement of the event stated above. 



2.1 .2 Classical Concept of Probability 

Random data by its very nature is indeterminate. So how 
can a scientific theory attempt to deal with indeterminacy? 
Probability theory does just that, and is based on the fact 
that though the result of any particular result of an experi- 
ment cannot be predicted, a long sequence of performances 
taken together reveals a stability that can serve as the basis 
for fairly precise predictions. 

Consider the case when an experiment was carried out a 
number of times and the anticipated event E occurred in some 
of them. Relative frequency is the ratio denoting the fraction 
of events when success has occurred. It is usually estimated 
empirically after the event from the following proportion: 



P(E): 



number of times E occured 



number of times the experiment was carried out 



(2.1) 



For certain simpler events, one can determine this proportion 
without actually carrying out the experiment; this is referred 
to as ''wise before the event' . For example, the relative fre- 
quency of getting heads (selected as a "success" event) when 
tossing a fair coin is 0.5 In any case, this apriori proportion 
is interpreted as the long run relative frequency, and is refer- 
red to as probability. This is the classical, or frequentist or 
traditionalist definition, and has some theoretical basis. This 
interpretation arises from the strong law of large numbers (a 
well-known result in probability theory) which states that the 
average of a sequence of independent random variables ha- 
ving the same distribution will converge to the mean of that 
distribution. If a dice is rolled, the probability of getting a pre- 
selected number between 1 and 6 (say, 4) will vary from event 
to event, but on an average will tend to be close to 1/6. 



2.1 .3 Bayesian Viewpoint of Probability 

The classical or traditional probability concepts are associa- 
ted with the frequentist view of probability, i.e., interpreting 
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probability as the long run frequency. This has a nice intui- 
tive interpretation, hence its appeal. However, people have 
argued that most processes are unique events and do not 
occur repeatedly, thereby questioning the validity of the fre- 
quentist or objective probability viewpoint. Even when one 
may have some basic preliminary idea of the probability as- 
sociated with a certain event, the frequentist view excludes 
such subjective insights in the determination of probability. 
The Bayesian approach, however, recognizes such issues 
by allowing one to update assessments of probability that 
integrate prior knowledge with observed events, thereby all- 
owing better conclusions to be reached. Both the classical 
and the Bayesian approaches converge to the same results 
as increasingly more data (or information) is gathered. It 
is when the data sets are small that the additional benefit 
of the Bayesian approach becomes advantageous. Thus, the 
Bayesian view is not an approach which is at odds with the 
frequentist approach, but rather adds (or allows the addition 
of) refinement to it. This can be a great benefit in many 
types of analysis, and therein lies its appeal. The Bayes' 
theorem and its application to discrete and continuous pro- 
bability variables are discussed in Sect. 2.5, while Sect. 4.6 
(of Chap. 4) presents its application to estimation and hypo- 
thesis problems. 



2.2 Classical Probability 

2.2.1 Permutations and Combinations 

The very first concept needed for the study of probability 
is a sound knowledge of combinatorial mathematics which 
is concerned with developing rules for situations involving 
permutations and combinations. 

(a) Permutation P(n, k) is the number of ways that k ob- 
jects can be selected from n objects with the order being im- 
portant. It is given by: 



P(n,k) = 



(n - k)! 



(2.2a) 



A special case is the number of permutations of n objects 
taken n at a time: 

P(n,n) = n! = n(n - l)(n - 2)...(2)(1) (2.2b) 



Note that the same equation also defines the binomial coef- 
ficients since the expansion of (a-nb)" according to the Bino- 
mial theorem is 

(a + b)" = ^ [ ^ ')a"-'^b''. (2.4) 

Example 2.2.1: (a) Calculate the number of ways in which 
three people from a group of seven people can be seated in 
a row. 

This is a case of permutation since the order is important. 
The number of possible ways is: 



P(7,3) 



7! 



(7-3)! 



(7) ■ (6) ■ (5) 
1 
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(b) Calculate the number of combinations in which three 
people can be selected from a group of seven. 

Here the order is not important and the combination for- 
mula can be used. Thus: 



C(7,3) 



7! 



(7 - 3)!3! 



(7) ■ (6) ■ (5) 
(3) ■ (2) 
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Another type of combinatorial problem is the factorial pro- 
blem to be discussed in Chap. 6 while dealing with design of 
experiments. Consider a specific example involving equip- 
ment scheduling at a physical plant of a large campus which 
includes primemovers (diesel engines or turbines which pro- 
duce electricity), boilers and chillers (vapor compression and 
absorption machines). Such equipment need a certain amount 
of time to come online and so operators typically keep some 
of them "idling" so that they can start supplying electricity/ 
heating/cooling at a moment's notice. Their operating states 
can be designated by a binary variable; say "1" for on-sta- 
tus and "0" for off-status. Extensions of this concept include 
cases where, instead of two states, one could have m states. 
An example of 3 states is when say two identical boilers are 
to be scheduled. One could have three states altogether: (i) 
when both are off (0-0), (ii) when both are on (1-1), and (iii) 
when only one is on (1-0). Since the boilers are identical, 
state (iii) is identical to 0-1. In case, the two boilers are of 
different size, there would be four possible states. The num- 
ber of combinations possible for "n" such equipment where 
each one can assume "m" states is given by m". Some simple 
cases for scheduling four different types of energy equipment 
in a physical plant are shown in Table 2.1. 



(b) Combinations C(n, k) is the number of ways that k ob- 
jects can be selected from n objects with the order not being 
important. It is given by: 



C(n,k): 



(n - k)!k! 



(2.3) 



2.2.2 Compound Events and Probability Trees 

A compound or joint or composite event is one which arises 
from operations involving two or more events. The use of Venn's 
diagram is a very convenient manner of illustrating and unders- 
tanding compound events and their probabilities (see Fig. 2. 1). 
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Table 2.1 Number of combinations for equipment scheduling in a large facility 



Status (0- off, 1- on) 



Two of each-non-identical 
except for boilers 



Number of 





Primemovers 


Boilers 


Chillers-Vapor 
compression 


Chillers- 
Absorption 


Combmations 


One of each 


0-1 


0-1 


0-1 


0-1 


2"= 16 


Two of each-assumed identical 


0-0,0-1,1-1 


0-0,0-1,1-1 


0-0,0-1,1-1 


0-0, 0-1, 1-1 


3*=81 



0-0, 0-1, 1-0, 1-1 0-0, 0-1, 1-0 0-0, 0-1, 1-0, 1-1 0-0, 0-1, 1-0, 1-1 43x3' = 192 



The universe of outcomes or sample space is denoted by a 
rectangle, while the probability of a particular event (say, 
event A) is denoted by a region (see Fig. 2. la); 
union of two events A and B (see Fig. 2.1b) is represen- 
ted by the set of outcomes in either A or B or both, and is 
denoted by A uZ? (where the symbol u is conveniently re- 
membered as "u" of "union"). An example is the number of 
cards in a pack which are either hearts or spades (26 nos.); 



intersection of two events A and B is represented by the 
set of outcomes in both A and B simultaneously, and is 
denoted by A n5. It is represented by the hatched area in 
Fig. 2.1b. An example is the number of red cards which 
are jacks (2 nos.); 

mutually exclusive events or disjoint events are those which 
have no outcomes in common (Fig. 2.1c). An example is 
the number of red cards with spades seven (nil); 



Fig. 2.1 Venn diagrams 
for a few simple cases, a 
Event A is denoted as a 
region in space S. Proba- 
bility of event A is repre- 
sented by the area inside 
the circle to that inside 
the rectangle, b Events A 
and B are intersecting, i.e., 
have a common overlap- 
ping area (shown hatched). 
c Events A and B are 
mutually exclusive or are 
disjoint events, d Event B 
is a subset of event A 





intersection 
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event B is inclusive in event A when all outcomes of 
B are contained in those of A, i.e., B is a sub-set of A 
(Fig. 2. Id). An example is the number of cards less than 
six (event B) which are red cards (event A). 



2.2.3 Axioms of Probability 

Let the sample space S consist of two events A and B with 
probabilities p(A) and p(B) respectively. Then: 
(i) probability of any event, say A, cannot be negative. This 
is expressed as: 

p(A) > (2.5) 

(ii) probabilities of all events must be unity (i.e., normal- 
ized): 

piS) = piA) + p{B) = 1 (2.6) 

(iii) probabilities of mutually exclusive events add up: 



p{A U B) = p{A) + p{B) 

if A and B are mutually exclusive 



(2.7) 



If a dice is rolled, the outcomes are mutually exclusive. If 
event A is the occuiTence of 2 and event B that of 3, then 
p(A or B)= 1/6+ 1/6=1/3. Mutually exclusive events and in- 
dependent events are not to be confused. While the former is 
a property of the events themselves, the latter is a property 
that arises from the event probabilities and their intersections 
(this is elaborated further below). 

Some other inferred relations are: 
(iv) probability of the complement of event A: 

p(A)^l-piA) (2.8) 

(v) probability for either A or B (when they are not mutual- 
ly exclusive) to occur is equal to: 



PiA U B) = p(A) + p{B) - p{A n B) 



(2.9) 



This is intuitively obvious from the Venn diagram (see 
Fig. 2.1b) since the hatched area (representing p{A n B)) gets 
counted twice in the sum and, so needs to be deducted once. 
This equation can also be deduced from the axioms of pro- 
bability. Note that if events A and B are mutually exclusive, 
then Eq. 2.9 reduces to Eq. 2.7. 



2.2.4 Joint, Marginal and Conditional 
Probabilities 

(a) Joint probability of two independent events represents 
the case when both events occur together, i.e. p(A and B) = 
p{AnB). It is equal to: 



p(AnB)^p{A)-p{B) 
if A and B are independent 



(2.10) 



These are called product models. Consider a dice tossing ex- 
periment. If event A is the occurrence of an even number, 
then p(A) = 1/2. If event B is that the number is less than or 
equal to 4, then p(B) = 2/3. The probability that both events 
occur when a dice is rolled is p(A and B) = 1/2x2/3 =1/3. 
This is consistent with our intuition since events { 2,4 } would 
satisfy both the events. 

(b) Marginal probability of an event A refers to the pro- 
bability of A in a joint probability setting. For example, con- 
sider a space containing two events, A and B. Since S can be 
taken to be the sum of event space B and its complement B, 
the probability of A can be expressed in terms of the sum of 
the disjoint parts of B: 



p(A) = p{A n fi) + p(A n B : 



(2.11) 



This notion can be extended to the case of more than two 
joint events. 

Example 2.2.2: Consider an experiment involving drawing 
two cards from a deck with replacement. Let event A = { first 
card is a red one } and event B = { card is between 2 and 8 inclu- 
sive}. How Eq. 2.11 applies to this situation is easily shown. 
Possible events A: hearts (13 cards) plus diamonds (13 
cards) 

Possible events B: 4 suites of 2, 3, 4, 5, 6, 7, 8. 
1 (7) -(4) 14 



Also, piADB)^ 



p{A n B) = 



2 52 

1 (13 - 7) ■ (4) _ 

2 ' 52 



52 



12 

52 



Consequently, from Eq. 2.11: p{A) 



and 



14 

52 



12 

52 



This result of p(A)= 1/2 is obvious in this simple experiment, 
and could have been deduced intuitively. However, intuition 
may mislead in more complex cases, and hence, the useful- 
ness of this approach. ■ 
(c) Conditional probability: There are several situations 
involving compound outcomes that are sequential or succes- 
sive in nature. The chance result of the first stage determi- 
nes the conditions under which the next stage occurs. Such 
events, called two-stage (or multi-stage) events, involve step- 
by-step outcomes which can be represented as a probability 
tree. This allows better visualization of how the probabilities 
progress from one stage to the next. If A and B are events, 
then the probability that event B occurs given that A has al- 
ready occurred is given by: 



p{B/A) 



p(A n B) 



(2.12) 



2.2 Classical Probability 



31 



A special but important case is wiien p(B/A)=p(B). In this 
case, B is said to be independent of A because the fact that 
event A has occurred does not affect the probability of B oc- 
curring. Thus, two events A and B are mutually exclusive if 
p{B/A)=p(B). In this case, one gets back Eq. 2.10. 

An example of a conditional probability event is the dra- 
wing of a spade from a pack of cards from which a first card 
was already drawn. If it is known that the first card was not 
a spade, then the probability of drawing a spade the second 
time is 12/5 1 =4/17. On the other hand, if the first card drawn 
was a spade, then the probability of getting a spade on the 
second draw is 11/51. 

Example 2.2.3: A single fair dice is rolled. Let event A= 
{even outcome} and event B={outcome is divisible by 3}. 

(a) List the various events in the sample space: {1 23456} 

(b) List the outcomes in A and find p(A): {2 4 6}, 
p(A)=l/2 

(c) List the outcomes of B and find p(B): {3 6}, p(B)= 1/3 

(d) List the outcomes in Ar^B and find p{Ar^B): {6}, 
piAr^B)=l/6 

(e) Are the events A and B independent? Yes, since 
Eq. 2.10 holds ■ 

Example 2.2.4: Two defective bulbs have been mixed with 
10 good ones. Let event A= {first bulb is good}, and event 
B = { second bulb is good } . 
(a) If two bulbs are chosen at random with replacement, 

what is the probability that both are good? 

p(A) = 8/10 and /?(5) = 8/10. Then; 



p(A n B) = 



64 



= 0.64 



(b) 



10 10 100 

What is the probability that two bulbs drawn in sequen- 
ce (i.e., not replaced) are good where the status of the 
bulb can be checked after the first draw? 
From Eq. 2.12, p(both bulbs drawn are good): 



p(A n 5) = p(A) ■ p{B/A) 



8 7 
To ' 9 



28 
45 



0.622 



Example 2.2.5: Two events A and B have the following pro- 
babilities: p{A) = 0.3, piB) = 0.4 and p(A n B) = 0.28 • 

(a) Determine whether the events A and B are independent 
or not? 

From Eq. 2.8, P(A) = 1 - p{A) = 0.7 . Next, one will 
verify whether Eq. 2.10 holds or not. In this case, one 
needs to verify whether: p(A (IB) = p{A) ■ p(B) ■ or 
whether 0.28 is equal to (0.7 x 0.4). Since this is correct, 
one can state that events A and B are independent. 

(b) Findp(AuB) 

From Eqs. 2.9 and 2.10: 

piA U B) = p(A) + p(B) - p(A n B) 

= p(A) + p(B) - piA) ■ p(B) 

= 0.3-|-0.4-(0.3)(0.4) = 0.58 ■ 



Fig. 2.2 The forward probability 
tree for the residential air-con- 
ditioner when two outcomes are 
possible (5 satisfactory or NS not 
satisfactory) for each of three 
day-types (VH very hot, H hot 
and NH not hot) 







VH^ 


^ S 




0.1 


/^"^ 


^ NS 


Day/ 
type\ 


0.3 


09^ 


- S 




0^ 


N>^^ 


"S 






NH 5>- 


--NS 



Example 2.2.6: Generating a probability tree for a residen- 
tial air-conditioning (AC) system. 

Assume that the AC is slightly under-sized for the house it 
serves. There are two possible outcomes (S- satisfactory and 
NS- not satisfactory) depending on whether the AC is able 
to maintain the desired indoor temperature. The outcomes 
depend on the outdoor temperature, and for simplicity, its 
annual variability is grouped into three categories: very hot 
(VH), hot (H) and not hot (NH). The probabilities for out- 
comes S and NS to occur in each of the three day-type ca- 
tegories are shown in the probability tree diagram (Fig. 2.2) 
while the joint probabilities computed following Eq. 2.10 are 
assembled in Table 2.2. 

Note that the relative probabilities of the three branches 
in both the first stage as well as in each of the two bran- 
ches of each outcome add to unity (for example, in the Very 
Hot, the S and NS outcomes add to 1.0, and so on). Further, 
note that the joint probabilities shown in the table also have 
to sum to unity (it is advisable to perform such verificati- 
on checks). The probability of the indoor conditions being 
satisfactory is determined as: p(S)=0.02H- 0.27 H- 0.6 = 0.89 
while p(NS)= 0.08h-0.03h-0=0.11. It is wise to verify that 
p(S)H-p(NS)=1.0. ■ 

Example 2.2.7: Consider a problem where there are two bo- 
xes with marbles as specified: 
Box 1: 1 red and 1 white and Box 2: 4 red and 1 green 

A box is chosen at random and a marble drawn from it. 
What is the probability of getting a red marble? 

One is tempted to say that since there are 4 red marbles in 
total out of 6 marbles, the probability is 2/3. However, this 
is incorrect, and the proper analysis approach requires that 
one frame this problem as a two-stage experiment. The first 
stage is the selection of the box, and the second the drawing 



Table 2.2 Joint probabilities of various outcomes 
p(VHr\S) = 0.l X 0.2 = 0.02 

piVH n NS) = 0.1 X 0.8 = 0.08 

p(H n 5) = 0.3 X 0.9 = 0.27 



p{H n NS) = 0.3 X 0.1 = 0.03 



p(NH r\S) = 0.6 X 1.0 = 0.6 
p(NH r\NS) = 0.6xO = 



32 



2 Probability Concepts and Probability Distributions 



Table 2.3 Probabilities of various outcomes 


p{AnR)= 1/2 X 1/2 = 1/4 


p(BnR)= 1/2 X 3/4 = 3/8 


p(A nW)= 1/2 X 1/2 = 1/4 


p(Br\W)= 1/2x0 = 


p{A n G) = 1/2 X = 


p(B n G) = 1/2 X 1/4 = 1/8 



Box 



Marble 
color 




p(AnR) 


=1/4 


p{AnW) 




p(B n R) 


=3/8 


p(B n G) 





=1/4 



=1/8 



=5/8 



Fig. 2.3 The first stage of the forward probability tree diagram involves 
selecting a box (either A or B) while the second stage involves drawing a 
marble which can be red (R), white (W) or green (G) in color. The total 
probability of drawing a red marble is 5/8 



of the marble. Let event A (or event B) denote choosing Box 
1 (or Box 2). Let R, W and G represent red, white and green 
marbles. The resulting probabilities are shown in Table 2.3. 
Thus, the probability of getting a red mar- 
ble =1/4 +3/8 =5/8. The above example is depicted in 
Fig. 2.3 where the reader can visually note how the proba- 
bilities propagate through the probability tree. This is called 



the "forward tree" to differentiate it from the "reverse" tree 
discussed in Sect. 2.5. 

The above example illustrates how a two-stage experi- 
ment has to be approached. First, one selects a box which 
by itself does not tell us whether the marble is red (since 
one has yet to pick a marble). Only after a box is selected, 
can one use the prior probabilities regarding the color of the 
marbles inside the box in question to determine the proba- 
bility of picking a red marble. These prior probabilities can 
be viewed as conditional probabilities; i.e., for example, 
p(An /?) = p(R/A)-p(A) ■ 



2.3 Probability Distribution Functions 

2.3.1 Density Functions 

The notions of discrete and continuous random variables 
were introduced in Sect. 2.1.1. The distribution of a random 
variable represents the probability of it taking its various pos- 
sible values. For example, if the y-axis in Fig. 1.1 of the dice 
rolling experiment were to be changed into a relative fre- 
quency (= 1/6), the resulting histogram would graphically re- 
present the corresponding probability density function (PDF) 
(Fig. 2.4a). Thus, the probability of getting a 2 in the rolling 
of a dice is l/6th. Since, this is a discrete random variable, 
the function takes on specific values at discrete points of the 
X-axis (which represents the outcomes). The same type of y- 
axis normalization done to the data shown in Fig. 1.2 would 
result in the PDF for the case of continuous random data. This 
is shown in Fig. 2.5a for the random variable taken to be the 
hourly outdoor dry bulb temperature over the year at Phila- 



Fig. 2.4 Probability functions 
for a discrete random variable 
involving the outcome of rolling 
a dice, a Probability density 
function, b Cumulative distribu- 
tion function 



f(x) 
1/6 



9 9 9 9 9 9 



F(x) 
1.0 
2/6 
2/6 " 



Fig. 2.5 Probability density 
function and its association with 
probability for a continuous ran- 
dom variable involving the outco- 
mes of hourly outdoor tempera- 
tures at Philadelphia, PA during 
a year. The probability that the 
temperature will be between 55° 
and 60°F is given by the shaded 
area, a Density function, b Proba- 
bility interpreted as an area 
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Fig. 2.6 The cumulative distribution function (CDF) for the PDF 
shown in Fig. 2.5. Such a plot allows one to easily determine the proba- 
bility that the temperature is less than 60°F 



delphia, PA. Notice that this is the envelope of the histogram 
of Fig. 1.2. Since the variable is continuous, it is implausible 
to try to determine the probability of, say temperature outco- 
me of 57.5°F. One would be interested in the probability of 
outcomes within a range, say 55°-60°F. The probability can 
then be determined as the area under the PDF as shown in 
Fig. 2.5b. It is for such continuous random variables that the 
cumulative distribution function (CDF) is useful. It is simply 
the cumulative area under the curve starting from the lowest 
value of the random variable to the current value (Fig. 2.6). 
The vertical scale directly gives the probability (or, in this 
case, the fractional time) that X is less than or greater than a 
certain value. Thus, the probability (x< 60) is about 0.58. The 
concept of CDF also applies to discrete variables as illustra- 
ted in Fig. 2.4b for the dice rolling example. 

To restate, depending on whether the random variable is 
discrete or continuous, one gets discrete or continuous pro- 
bability distributions. Though most experimentally gathered 
data is discrete, the underlying probability theory is based 
on the data being continuous. Replacing the integration sign 
by the summation sign in the equations that follow allows 
extending the following definitions to discrete distributions. 
Let f(x) be the probability distribution function associated 
with a random variable X. This is a function which provides 
the probability that a discrete random variable X takes on 
some specific value x among its various possible values. The 
axioms of probability (Eqs. 2.5 and 2.6) for the discrete case 
are expressed for the case of continuous random variables as: 

• PDF cannot be negative: 

/(x) > -oo<x<oo (2.13) 

• Probability of the sum of all outcomes must be unity 



/ 



f{x)dx — 1 



(2.14) 



The cumulative distribution function (CDF) or F(a) repre- 
sents the area under f(x) enclosed in the range -'x<x<a: 

a 

f,<.,= „X^ „, = //„., (2.15) 

— 00 

The inverse relationship between f(x) and F(a), provided a 
derivative exists, is: 



/(x) 



dF{x) 
dx 



(2.16) 



This leads to the probability of an outcome a<X<b given by: 



P{ 



a<X <b]^ j 



f{x)dx 



* " (2.17) 

— / f(x)dx— j f(x)dx 

— oo — oo 

= Fib) - F{a) 

Notice that the CDF for discrete variables will be a step 
function (as in Fig. 2.4b) since the PDF is defined at di- 
screte values only. Also, the CDF for continuous variab- 
les is a function which increases monotonically with in- 
creasing X. For example, the probability of the outdoor 
temperature being between 55° and 60°F is given by 
p{55 < X < 60} = F{b) - F(a) = 0.58 - 0.50 = 0.08 
(see Fig. 2.6). 

The concept of probability distribution functions can be 
extended to the treatment of simultaneous outcomes of mul- 
tiple random variables. For example, one would like to study 
how temperature of quenching of a particular item made of 
steel affects its hardness. Let X and Y be the two random 
variables. The probability that they occur together can be re- 
presented by a function f(x, y) for any pair of values (x, y) 
within the range of variability of the random variables X and 
Y. This function is referred to as the joint probability density 
function of X and Y which has to satisfy the following pro- 
perties for continuous variables: 



/(■*^,>')>0 for all (x, y) 



OO OO 

// 



f{x,y)dxdy — 1 



(2.18) 



(2.19) 



p[iX, Y) e A] 



ff 



f{x,y)dxdy (2.20) 



where A is any region in the xy plane. 

If X and Y are two independent random variables, their 
joint PDF will be the product of their marginal ones: 
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f{x,y)^f{x)-f{y) 



(2.21) 



CDF 



Note that this is the continuous variable counterpart of 
Eq. 2.10 which gives the joint probability of two discrete 
events. 

The marginal distribution of X given two jointly distri- 
buted random variables X and Y is simply the probability 
distribution of X ignoring that of Y. This is determined for 
Xas: 



u 



20 



{x + 100)3 



zdx 



10 



{x + 100)2 



(a) with x = 20, the probability that the life is at least 20 
weeks; 



p{2Q < X < oo) = 



10 

"(x + 100)2 



= 0.000694 



J 20 



/ 



g{x)^ / f{x,y)dy 



(2.22) 



(b) for this case, the limits of integration are simply modi- 
fied as follows: 



Finally, the conditional probability distribution of X given 
that X=x for two jointly distributed random variables X and 
Yis: 

/(yA) = -^ g(x)>0 (2.23) 

six) 

Example 2.3.1 : Determine the value of c so that each of the 
following functions can serve as probability distributions of 
the discrete random variable X: 

(a) /(x) = c(x2+4) /orx =0,1,2,3 

(b) f(x) = ax^ for-1 <x<2 

(a) One uses the discrete version of Eq. 2.14, i.e., 

3 

\^ f(xi) — 1 leads to 4c H- 5c H- 8c -Hi 3c =1 from which 

1=0 

c=l/30 

(b) One uses Eq. 2.14 modified for the limiting range in x: 

2 r 31^ 

/ ax^dx — 1 from which \'^\ =1 resulting in 

a = 1/3- ■ 

Example 2.3.2: The operating life in weeks of a high effi- 
ciency air filter in an industrial plant is a random variable X 
having the PDF: 

20 

f(x) — ^ for X > 

■' ^ ^ (x + 100)3 ■' 

Find the probability that the filter will have an operating life 
of: 

(a) at least 20 weeks 

(b) anywhere between 80 and 120 weeks 

First, determine the expression for the CDF from Eq. 2.14. 
Since the operating life would decrease with time, one needs 
to be careful about the limits of integration applicable to this 
case. Thus, 



p(80 < X < 120) 



10 



(x + 100)2 



120 



80 



0.000102 



Example 2.3.3: Consider two random variables X and Y 
with the following joint density function: 

f(x,y)^^(2x + 3y) 
for 0<x<l,0<y<l 

(a) Verify whether the normalization criterion is satisfied. 
This is easily verified from Eq. 2.19: 



00 OO i 1 

/ / f(x,y)dxdy— I I -{2x + 3y)dxdy 

— OO — OO 

1 

-f 







1 



2x2 6xy 



dy 



x=0 



2 3 , 
A> = - + - = 1 
5 5 / -^ 5 5 



(b) Determine the joint probability in the region 
(0<x<l/2, l/4<y<l/2). In this case, one uses 
Eq. 2.20 as follows: 



1/2 1/2 



p(0 < X < 1/2, 1/4 < y < 1/2) 



-ff 

1/4 

13 
~ 160 



:(2x + 3y)dxdy 



(c) Determine the marginal distribution g(x). From 
Eq. 2.22: 



1 

/ 



g{x)^ I -i2x + 3y)dy 



4xy 6y 



2-\y=^ 



4x + 3 



y=0 
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Table 2.4 Computing marginal probabilities from a probability table 



Age (Y) 


Income (X) 


Marginal 

probability 

ofY 




>$ 40,000 40,000-90,000 < 90,000 


Under 25 


0.15 0.09 0.05 


0.29 



Between 25^0 0.10 
Above 40 
Marginal 



0.16 



0.12. 



0.38 



0.08 



0.20 



0.05 



0.33 



0.33 



0.45 



0.22 



probability of X 



Should 
sum to 1 .00 
both ways 



Example 2.3.4: The percentage data of annual income ver- 
sus age has been gathered from a large population living 
in a certain region — see Table 2.4. Let X be the income 
and Y the age. The marginal probability of X for each 
class is simply the sum of the probabilities under each co- 
lumn and that of Y the sum of those for each row. Thus, 
p(X > 40,000) = 0.15 + 0.10 + 0.08 = 0.33, and so on. 
Also, verify that the sum of the marginal probabilities of X and 
Y sum to 1.00 (so as to satisfy the normalization condition). 



2.3.2 Expectation and Moments 

This section deals with ways by which one can summarize 
the characteristics of a probability function using a few im- 
portant measures. Commonly, the mean or the expected value 
E[X] is used as a measure of the central tendency of the dis- 
tribution, and the variance var[X] as a measure of dispersion 
of the distribution about its mean. These are very similar to 
the notions of arithmetic mean and variance of a set of data. 
As before, the equations which apply to continuous random 
variables are shown below; in case of discrete variables, the 
integrals have to be replaced with summations. 
• expected value of the first moment or mean 



E[X] ^ M 



-I 



xf(x)dx. 



(2.24) 



var[Z] = E[X^] - /Lt^ 



(2.25b) 



Notice the appearance of the expected value of the second 
moment E[X^] in the above equation. The variance is ana- 
logous to the physical concept of the moment of inertia of a 
mass distribution about its center of gravity. 

In order to express the variance which is a measure of 
dispersion in the same units as the random variable itself, the 
square root of the variance, namely the standard deviation 
a is used. Finally, errors have to be viewed, or evaluated, 
in terms of the magnitude of the random variable. Thus, the 
relative error is often of more importance than the actual 
error. This has led to the widespread use of a dimensionless 
quantity called the Coefficient of Variation (CV) defined as 
the percentage ratio of the standard deviation to the mean: 



CV = 100-(-) (2.26) 



2.3.3 Function of Random Variables 

The above definitions can be extended to the case when the 
random variable X is a function of several random variables; 
for example: 



X — ao + a\X\ + (32^2- 



(2.27) 



where the a coefficients are constants and X are random 

1 1 

variables. 

Some important relations regarding the mean: 

E[aQ\ = flo 

E[ayXy\^aiE[Xi] (2-28) 

E[aQ + a\X\ + 02^2] = «o + a\E[X\\ + a2E[X2\ 

Similarly there are a few important relations that apply to 
the variance: 



var[ao] = 
var[aiZi] — a^var[Zi] 



(2.29) 



The mean is exactly analogous to the physical concept of 
center of gravity of a mass distribution. This is the reason 
why PDF are also referred to as the "mass distribution func- 
tion". The concept of symmetry of a PDF is an important one 
implying that the distribution is symmetric about the mean. 
A distribution is symmetric if: p{jjL — x) — p{fji + x) for 
every value of x. 



var[X] = a^ ^ E[{X - /Lt)^] = / (x - fiff(x)dx 

—00 

(2.25a) 
Alternatively, it can be shown that for any discrete distribu- 
tion: 



Again, if the two random variables are independent. 



var[flo + aiXi + 02^2] — a}var[Xi] + fl2var[X2] (2.30) 



The notion of covariance of two random variables is an 
important one since it is a measure of the tendency of two 
random variables to vary together. The covariance is defined 
as: 

cov[Xi,X2] = E[(X, - Ml) • {X2 - 111)] (2.31) 

where 1.1^ and 1.1^ are the mean values of the random variables 
Xj and X, respectively. Thus, for the case of two random 
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Fig. 2.7 Skewed and symmetric 
distributions, a Skewed to tlie 
right, b Symmetric, c Skewed to 
tlie left 




variables which are not independent, Eq. 2.30 needs to be 
modified into: 



var[ao + a^Xi + 02^2] = flj var[Xi] + a2vav[X2] 
+ 2aia2.cov[Xi,X2] 



(2.32) 



Moments higher than the second are sometimes used. For 
example, the third moment yields the skewness which is a 
measure of the symmetry of the PDF. Figure 2.7 shows three 
distributions: one skewed to the right, a symmetric distribu- 
tion, and one skewed to the left. The fourth moment yields 
the coefficient of kurtosis which is a measure of the peaki- 
ness of the PDF. 

Two commonly encountered terms are the median and 
the mode. The value of the random variable at which the 
PDF has a peak is the mode, while the median divides the 
PDF into two equal parts (each part representing a probabi- 
lity of 0.5). 

Finally, distributions can also be described by the number 
of "humps" they display. Figure 2.8 depicts the case of uni- 
modal and bi-modal distributions, while Fig. 2.5 is the case 
of a distribution with three humps. 

Example 2.3.5: Let X be a random variable representing 
the number of students who fail a class. Its PDF is given in 
Table 2.5. 

The discrete event form of Eqs. 2.24 and 2.25 is used to com- 
pute the mean and the variance: 

M = (0)(0.51) + (1)(0.38) + (2)(0.10) + (3)(0.01) 
= 0.61 

Further: 

E{X^) = (0)(0.51) -I- (1^)(0.38) + (22)(0.10) + (32)(0.01) 
= 0.87 
Hence: 

ct2 ^ 0.87 - (0.61)2 ^ Q 4979 g 



Table 2.5 PDF of number of students failing a class 




X 1 2 


3 


f(x) 0.51 0.38 0.10 


0.01 



Example 2.3.6: Consider Example 2.3.2 where a PDF of 
X is defined. Let g(x) be a function of this PDF such that 

g(x)=Ax+3. 

One wishes to determine the expected value of g(X). From 

Eq. 2.24, 



£[/-(x)] 



U 

/ 



20x 



(x + 100)3 



rCfe^O.l 



Then from Eq. 2.28 

E[g{X)] = 3 



■4.£[/(X)] = 3.4 




Fig. 2.8 Unimodal and bi-modal distributions, a Unimodal, b Bi-modal 
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2.4 Important Probability Distributions 

2.4.1 Background 

Data arising from an occurrence or phenomenon or descrip- 
tive of a class or a group can be viewed as a distribution of 
a random variable with a PDF associated with it. A majority 
of data sets encountered in practice can be described by one 
(or two) among a relatively few PDFs. The ability to charac- 
terize data in this manner provides distinct advantages to the 
analysts in terms of: understanding the basic dynamics of the 
phenomenon, in prediction and confidence interval specifi- 
cation, in classification, and in hypothesis testing (discussed 
in Chap. 4). Such insights eventually allow better decision 
making or sounder structural model identification since they 
provide a means of quantifying the random uncertainties 
inherent in the data. Surprisingly, most of the commonly 
encountered or important distributions have a common ge- 
nealogy, shown in Fig. 2.9 which is a useful mnemonic for 
the reader. 



2.4.2 Distributions for Discrete Variables 

(a) Bernouilli Process. Consider an experiment involving 
repeated trials where only two complementary outcomes are 
possible which can be labeled either as a "success" or a "fai- 
lure". Such a process is called a Bernouilli process: (i) if the 
successive trials are independent, and (ii) if the probability 
of success p remains constant from one trial to the next. Note 
that the number of partitions or combinations of n outcomes 
into two groups with x in one group and (n-x) in the other is 
equal to 



C{n,x) — 



(b) Binomial Distribution. The number of successes in n 
Bernouilli trials is called a binomial random variable. Its 
PDF is called a Binomial distribution (so named because of 
its association with the terms of the binomial expansion). It 
is a unimodal distribution which gives the probability of x 
successes in n independent trials, if the probability of suc- 
cess in any one trial is p. Note that the outcomes must be 
Bernouilli trials. This distribution is given by: 



Fig. 2.9 Genealogy between 
different important probability 
distribution functions. Those that 
are discrete functions are repre- 
sented by "D" while the rest are 
continuous functions. (Adapted 
with modification from R. E. 
Lave, Jr. of Stanford University) 



Hypergeometric 



n trials 
w/o replacement 



Bernouilli Trials 
(two outcomes, 
success prob. p) 



n trials 
with replacement 



Number of trials 
before success 




Exponential 
E{A) 



Gamma 
G{a,A.) 



F-distribution 
F(m,n) 
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Fig. 2.10 Plots for the Binomial 
distribution illustrating the effect 
of probability of success p with 
X being the probability of the 
number of "successes" in a total 
number of n trials. Note how the 
skewness in the PDF is affected 
by p (frames a and b), and how 
the number of trials affects the 
shape of the PDF (frame a and c). 
Instead of vertical bars at discrete 
values of X as is often done for 
discrete distributions such as the 
Binomial, the distributions are 
shown as contour points so as to 
be consistent with how continu- 
ous distributions are represen- 
ted, a n=15 and p=0.1, b n=15 
and p=0.9, c n=100 and p=0.1, 
dn=100andp=0.1 
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with mean: ^ = (n.p) and variance 



cr2^np(l -p) 



(2.33a) 



(2.33b) 



Wiien n is small, it is easy to compute Binomial probabilities 
using Eq. 2.33a. For large values of n, it is more convenient 
to refer to tables which apply not to the PDF but to the corre- 
sponding cumulative distribution functions, referred to here 
as Binomial probability sums, defined as: 



B{r;n,p)^ ^B{x;n,p) 



(2.33c) 



There can be numerous combinations of n, p and r, which is 
a drawback to such tabular determinations. Table Al in Ap- 
pendix A illustrates the concept only for n= 15 and n=20 and 
for different values of p and r. Figure 2.10 illustrates how the 
skewness of the Binomial distribution is affected by p, and 
by the total number of trials n. 

Example 2.4.1: Let k be the number of heads in n=4 in- 
dependent tosses of a coin. Then the mean of the dis- 
tribution =(4) -(1/2) =2, and the variance (T- = (4)-(l/2)- 
(1-1/2)= 1. From Eq. 2.33a, the probability of two successes 
in four tosses = 



B(2;4,0.5) = 



4 

2 

4x3 
2 



4-2 



1 1 3 

X - X - = - 

4 4 8 



Example 2.4.2: The probability that a patient recovers from 
a type of cancer is 0.6. If 15 people are known to have con- 
tracted this disease, then one can determine probabilities of 
various types of cases using Table Al. Let X be the number 
of people who survive. 

(a) The probability that at least 5 survive is: 

4 

p{X >5)^ 1 -p{X <5)^ l-^S(x; 15,0.6) 

= 1 - 0.0094 = 0.9906 

(b) The probability that there will be 5 to 8 survivors is: 

8 4 

P(5 < JT < 8) = ^B{x; 15, 0.6) - ^ B(x; 15, 0.6) 

.v=0 .i:=0 

= 0.3902 - 0.0094 = 0.3808 

(c) The probability that exactly 5 survive: 

5 4 

piX = 5) = ^5(x; 15,0.6) -^B(x; 15,0.6) 

x=0 .v=0 

= 0.0338 - 0.0094 = 0.0244 ■ 



(c) Geometric Distribution. Rather than considering the 
number of successful outcomes, there are several physical 
instances where one would like to ascertain the time interval 
for a certain probability event to occur the first time (which 
could very well destroy the physical system). This proba- 
bility (p) is given by the geometric distribution which can 
be derived from the Binomial distribution. Consider N to be 
the random variable representing the number of trials until 
the event does occur. Note that if an event occurs the first 
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Fig. 2.11 Geometric distribution 
G(x;0.02) for Example 2.4.3 
where the random variable is the 
number of trials until the event 
occurs, namely the "50 year de- 
sign wind" at the coastal location 
in question, a PDF. b CDF 
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time during the n* trial then it did not occur during the pre- 
vious (n- 1) trials. Then, the geometric distribution is given 
by: 



G(n; p) = p ■ (1 - pf 



n= 1,2,3, 



(2.34a) 



An extension of the above concept relates to the time bet- 
ween two successive occurrences of the same event, called 
the recurrence time. Since the events are assumed indepen- 
dent, the mean recurrence time denoted by random variable 
T between two consecutive events is simply the expected va- 
lue of the Bernouilli distribution: 



T = EiT)^J2'-P^^-Py'' 



t=\ 



(2.34b) 



-. p[l + 2(1 - p) + 3(1 -pf] 



Example 2.4.3:' Using geometric PDF for 50 year design 
wind problems 

The design code for buildings in a certain coastal region spe- 
cifies the 50-year wind as the "design wind", i.e., a wind 
velocity with a return period of 50 years, or one which may 
be expected to occur once every 50 years. What are the pro- 
babilities that: 

(a) the design wind is encountered in any given year. From 

Eq. 2.34b, /' = " = ^ = ^-^^ 

(b) the design wind is encountered during the fifth year 
of a newly constructed building (from Eq. 2.34a): 
G(5;0.02) = (0.02).(1 - 0.02)^ = 0.018 

(c) the design wind is encountered within the first 5 years: 

5 

G(n <5;p) = J2 (0-02).(l - 0.02)'"' = 0.02 

+ 0.0196 + 0.0192 + 0.0188 + 0.0184 = 0.096 

Figure 2.1 1 depicts the PDF and the CDF for the geometric 
function corresponding to this example. ■ 



(d) Hypergeometric Distribution. The Binomial distribu- 
tion applies in the case of independent trials or when sam- 
pling from a batch of items is done with replacement. Anot- 
her type of dependence arises when sampling is done without 
replacement. This case occurs frequently in areas such as ac- 
ceptance sampling, electronic testing and quality assurance 
where the item is destroyed during the process of testing. If 
n items are to be selected without replacement from a set of 
N items which contain k items that pass a success criterion, 
the PDF of the number X of successful items is given by the 
hypergeometric distribution: 

C(k,x)-C(N -k,n -x) 



(2.35a) 



H(x;N,n,k) = - 




C(N,n) 




( k\ 


/ N -k 




^X J 


\ n — X 






{ N \ 

y n ) 


x = 0,1,2,3 ... 







nk 
with mean a — — and 

^ N 



N ■ 



variance a — 



N - 1 



k 

N 



(2.35b) 



Note that C(k, x) is the number of ways x items can be cho- 
sen from the k "successful" set, while C(N-k, n-x) is the 
number of ways that the remainder (n-x) items can be chosen 
from the "unsuccessful" set of (N-k) items. Their product 
divided by the total number of combinations of selecting 
equally likely samples of size n from N items is represented 
by Eq. 2.35a. 

Example 2.4.4: Lots of 10 computers each are called accep- 
table if they contain no fewer than 2 defectives. The proce- 
dure for sampling the lot is to select 5 computers at random 
and test for defectives. What is the probability that exactly 
one defective is found in the sample if there are 2 defectives 
in the entire lot? 

Using the hypergeometric distribution given by Eq. 2.35a 
with n = 5, N=10, k=2 and x= 1: 



' From Ang and Tang (2007) by permission of John Wiley and Sons. 
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//(I; 10,5,2) = 



10 

5- 



10 

5 



= 0.444 



(e) Multinomial Distribution. A logical extension to Ber- 
nouilli experiments where the result is a two-way outcome, 
either success/good or failure/defective, is the multinomi- 
al experiment where k possible outcomes are possible. An 
example of k=5 is when the grade of a student is either A, B, 
C, D or F. The issue here is to find the number of combina- 
tions of n items which can be partitioned into k independent 
groups (a student can only get a single grade for the same 
class) with x^ being in the first group, x^ in the second,... 
This is represented by: 



( " ) 



(2.36a) 



.Xk\ 



with the conditions that (xi + X2 + . . . + x^) = n and that 
all partitions are mutually exclusive and occur with equal 
probability from one trial to the next. It is intuitively obvious 
that when n is large and k is small, the hypergeometric distri- 
bution will tend to closely approximate the Binomial. 

Just like Bernouilli trials lead to the Binomial distribu- 
tion, the multinomial experiment leads to the multinomial 
distribution which gives the probability distribution of k ran- 
dom variables x^, x,,...Xj^ in n independent trials occurring 
with probabilities p^, p^,. . . Pj^: 



f{xx,X2,...Xk) 



with 2_, ^i — 1^ 
1=1 



n 



Pi -Pi 



■Pf (2.36b) 



and 



Ep- - 1 



1=1 



Example 2.4.5: Consider an examination given to 10 stu- 
dents. The instructor, based on previous years' experience, 
expects the distribution given in Table 2.6. 
On grading the exam, he finds that 5 students got an A, 3 got 
a B and 2 got a C, and no one got either D or F. What is the 
probability that such an event could have occurred purely by 
chance? 

This answer is directly provided by Eq. 2.36b which yi- 
elds the corresponding probability of the above event taking 
place: 



fiA,B,C,D,F) = (^ 5^3^2^0,0 ) ^^"^'^ ' ^^"^'^ ' ^*^-^'^' 

(O.l")- (0.1°) -0.00196 

This is very low, and hence this occurrence is unlikely to 
have occurred purely by chance. ■ 

(f) Poisson Distribution. Poisson experiments are those 
that involve the number of outcomes of a random variable 
X which occur per unit time (or space); in other words, as 
describing the occurrence of isolated events in a continuum. 
A Poisson experiment is characterized by: (i) independent 
outcomes (also referred to as memory less), (ii) probability 
that a single outcome will occur during a very short time 
is proportional to the length of the time interval, and (iii) 
probability that more than one outcome occurs during a very 
short time is negligible. These conditions lead to the Poisson 
distribution which is the limit of the Binomial distribution 
when n — > oo and p — > in such a way that the product (n.p) = 
It remains constant. It is given by: 



p(x;Xt) — 



(ktyexpi-kt) 



x = 0,1,2,3. 



x\ 



(2.37a) 



where 1 is called the ''mean occurrence rate", i.e., the ave- 
rage number of occurrences of the event per unit time (or 
space) interval t. A special feature of this distribution is that 
its mean or average number of outcomes /i per time t and its 
variance a^ are such that 



Ijl{X) = a^{X) ^kt =n- p 



(2.37b) 



Akin to the Binomial distribution, tables for certain combi- 
nations of the two parameters allow the cumulative Poisson 
distribution to be read off directly (see Table A2) with the 
latter being defined as: 



P(r;Af) = ^P(x;Ar) 



(2.37c) 



Applications of the Poisson distribution are widespread: the 
number of faults in a length of cable, number of suspended 
particles in a volume of gas, number of cars in a fixed length 
of roadway or number of cars passing a point in a fixed time 
interval (traffic flow), counts of a-particles in radio-active 
decay, number of arrivals in an interval of time (queuing 
theory), the number of noticeable surface defects found by 
quality inspectors on a new automobile, . . . 



Table 2.6 PDF of student grades for a class 




X A B C D 


F 


p(X) 0.2 0.3 0.3 0.1 


0.1 



Example 2.4.6: During a laboratory experiment, the ave- 
rage number of radioactive particles passing through a coun- 
ter in 1 millisecond is 4. What is the probability that 6 partic- 
les enter the counter in any given millisecond? 



2.4 Important Probability Distributions 



41 



Fig. 2.12 Poisson distribution 
for ttie number of storms per year 
where ?J=4 
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Using the Poisson distribution function (Eq. 2.37a) with 
x=6 and A,t=4: 



P(6;4) 



r.e- 



6! 



0.1042 



Example 2.4.7: The average number of planes landing at 
an airport each hour is 10 while the maximum number it can 
handle is 15. What is the probability that on a given hour 
some planes will have to be put on a holding pattern? 
In this case, Eq. 2.37c is used. From Table A2, with 
kt = 10 

15 

P(X > 15) = 1 - PiX < 15) = 1 - ^ P(x; 10) 

.v=0 

= 1-0.9513 = 0.0487 ■ 

Example 2.4.8: Using Poisson PDF for assessing stonn fre- 
quency 

Historical records at Phoenix, AZ indicate that on an average 
there are 4 dust storms per year. Assuming a Poisson dis- 
tribution, compute the probabilities of the following events 
using Eq. 2.37a: 
(a) that there would not be any storms at all during a year: 



(4)° ■ e-4 

p(X = 0) = — = 0.018 

' 0! 



(b) the probability that there will be four storms during a 
year: 

(4)V.-4 



p(X = 4) 



4! 



0.195 



Note that though the average is four, the probability of actu- 
ally encountering four storms in a year is less than 20%. Fi- 
gure 2.12 represents the PDF and CDF for different number 
of X values for this example. ■ 



2.4.3 Distributions for Continuous Variables 

(a) Gaussian Distribution. The Gaussian distribution or 
normal error function is the best known of all continuous 



distributions. It is a special case of the Binomial distribution 
with the same values of mean and variance but applicable 
when n is sufficiently large (n>30). It is a two-parameter 
distribution given by: 



N{x;fi,cT) = 



1 



a(27r)V2 



exp[ 



(x -m). 



(2.38a) 



where p and a are the mean and standard deviation respec- 
tively of the random variable X. Its name stems from an 
erroneous earlier perception that it was the natural pattern 
followed by distributions and that any deviation from it re- 
quired investigation. Nevertheless, it has numerous applica- 
tions in practice and is the most important of all distributions 
studied in statistics. Further, it is the parent distribution for 
several important continuous distributions as can be seen 
from Fig. 2.9. It is used to model events which occur by 
chance such as variation of dimensions of mass-produced 
items during manufacturing, experimental errors, variabili- 
ty in measurable biological characteristics such as people's 
height or weight, ... Of great practical import is that normal 
distributions apply in situations where the random variable is 
the result of a sum of several other variable quantities acting 
independently on the system. 

The shape of the normal distribution is unimodal and 
symmetrical about the mean, and has its maximum value 
at x=fj with points of inflexion at x — fi ±cr. Figure 2.13 
illustrates its shape for two different cases of p and a. Fur- 
ther, the normal distribution given by Eq. 2.38a provides a 
convenient approximation for computing binomial probabi- 
lities for large number of values (which is tedious), provided 
[n-p(l-p)]>10. 

In problems where the normal distribution is used, it is 
more convenient to standardize the random variable into a 
new random variable z = :^— ^ with mean zero and vari- 
ance of unity. This results in the standard normal curve or 
Z-curve: 



N{z;Q,\) 



1 



/27r 



exp(-z^/2). 



(2.38b) 
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Fig. 2.13 Normal or Gaussian distributions with same mean of 10 but 
different standard deviations. The distribution flattens out as the stan- 
dard deviation increases 



In actual problems, the standard normal distribution is used 
to determine the probability of the variate having a value wit- 
hin a certain interval, say z between z^ and z^. Then Eq. 2.38a 
can be modified into: 



Z2 

Af(Zi < Z < Z2) = / —-= 

J J2tt 



exp(-zV2)dz (2.38c) 



The shaded area in Table A3 permits evaluating the above in- 
tegral, i.e., determining the associated probability assuming 
Zj =-oo. Note that for z = 0, the probability given by the sha- 
ded area is equal to 0.5. Since not all texts adopt the same 
format in which to present these tables, the user is urged to 
use caution in interpreting the values shown in such tables. 

Example 2.4.9: Graphical interpretation of probability 
using the standard normal table 

Resistors made by a certain manufacturer have a nominal 
value of 100 ohms but their actual values are normally distri- 
buted with a mean of ^ = 100.6 ohms and standard deviation 
(7=3 ohms. Find the percentage of resistors that will have 
values: 



(i) higher than the nominal rating. The standard normal 
variable z(x=100) = (100-100.6)/3 = -0.2. From Table 
A3, this corresponds to a probability of 
(l-0.4207) = 0.5793 or 57.93%. 
(ii) within 3 ohms of the nominal rating (i.e., between 97 
and 103ohms).The lower limit z,=(97-100.6)/3 = 
-1.2, and the tabulated probability from Table A3 is 
p(z=-1.2)=0.1151 (as illustrated in Fig. 2.14a). The upper 
limit is: Z2=(103-100.6)/3=0.8. However, care should be 
taken in properly reading the corresponding value from Tab- 
le A3 which only gives probability values of z<0. One first 
determines the probability about the negative value symme- 
tric about 0, i.e., p(z=-0.8)=0.2119 (shown in Fig. 2.14b). 
Since the total area under the curve is 1.0, p(z=0.8)=1.0- 
0.2119=0.7881. Finally, the required probability 
p(-1.2<z<0.8)=(0.7881-0.1151)=0.6730or67.3%. ■ 
Inspection of Table A3 allows the following statements 
which are important in statistics: 

The interval /Lt ± ct contains approximately [1-2(0.1587)] 
= 0.683 or 68.3% of the observations. 
The interval /x ± 2ct contains approximately 95.4% of 
the observations. 

The interval /x ± 3ct contains approximately 99.7% of 
the observations. 

Another manner of using the standard normal table is for 
the "backward" problem. Instead of being specified the z 
value and having to deduce the probability, such a problem 
arises when the probability is specified and the z value is to 
be deduced. 

Example 2.4.10: Reinforced and pre-stressed concrete 
structures are designed so that the compressive stresses are 
carried mostly by the concrete itself. For this and other rea- 
sons the main criterion by which the quality of concrete is 
assessed is its compressive strength. Specifications for con- 
crete used in civil engineering jobs may require specimens of 
specified size and shape (usually cubes) to be cast and tested 
on site. One can assume the normal distribution to apply. If 
the mean and standard deviation of this distribution are p 
and a, the civil engineer wishes to determine the "statistical 
minimum strength" x specified as the strength below which 



Fig. 2.14 Figures meant to illus- 
trate that the shaded areas are the 
physical representations of the 
tabulated standardized probabi- 
lity values in Table A3, a Lower 
limit, b Upper limit 
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only say 5% of the cubes are expected to fail. One searches 
Table A3 and determines the value of z for which the proba- 
bility is 0.05, i.e., p(z=- 1.645) = 0.05. Hence, one infers that 
this would correspond to x=fj- 1.645(7. 

(b) Student t Distribution. One important application of 
the normal distribution is that it allows making statistical in- 
ferences about population means from random samples (see 
Sect. 4.2). In case the random samples are small (n< 30), then 
the t-student distribution, rather than the normal distribution, 
should be used. If one assumes that the sampled population 
is approximately normally distributed, then the random va- 
riable t — ^—^ has the Student t-distribution t(fj.,s,v) whe- 
re s is the sample standard deviation and v is the degrees of 
freedom =(n-l). Thus, the number of degrees of freedom 
(d.f.) equals the number of data points minus the number 
of constraints or restrictions placed on the data. Table A4 
(which is set up differently from the standard normal table) 
provides numerical values of the t-distribution for different 
degrees of freedom at different confidence levels. How to use 
these tables will be discussed in Sect. 4.2. Unlike the z curve, 
one has a family of t-distributions for different values of v. 
Qualitatively, the t-distributions are similar to the standard 
normal distribution in that they are symmetric about a zero 
mean, while they are but slightly wider than the correspon- 
ding normal distribution as indicated in Fig. 2.15. However, 
in terms of probability values represented by areas under the 
curves as in Example 2.4.9, the differences between the nor- 
mal and the student-t distributions are large enough to war- 
rant retaining this distinction. 

(c) Lognormal Distribution. This distribution is appro- 
priate for non-negative outcomes which are the product of a 
number of quantities. In such cases, the data are skewed and 
the symmetrical normal distribution is no longer appropriate. 
If a variate X is such that log(X) is normally distributed, then 
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Fig. 2.15 Comparison of the nornial (or Gaussian) z curve to two Stu- 
dent-t curves with different degrees of freedom (d.f.). As the d.f. in- 
crease, the PDF for the Student-t distribution flattens out and deviates 
increasingly from the normal distribution 
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Fig. 2.16 Lognormal distributions for different mean and standard de- 
viation values 

the distribution of X is said to be lognormal. With X ranging 
from — oo to -l-oo, log(X) would range from to -l-oo. Not 
only does the lognormal model accommodate skewness, but 
it also captures the non-negative nature of many variables 
which occur in practice. It is characterized by two parame- 
ters, the mean and variance (/x, a), as follows: 



L(x; [I, a) = 



1 



CT.x(V27r) 



exp 



(Inx — /i) 
2^^ 



= 



when .i: > 



elsewhere 



(2.39) 



The lognormal curves are a family of skewed curves as il- 
lustrated in Fig. 2.16. Lognormal failure laws apply when 
the degradation in lifetime is proportional to the previous 
amount of degradation. Typical applications in civil enginee- 
ring involve flood frequency, in mechanical engineering with 
crack growth and mechanical wear, and in environmental en- 
gineering with pollutants produced by chemical plants and 
threshold values for drug dosage. 

Example 2.4.11: Using lognormal distributions for pollu- 
tant concentrations 

Concentration of pollutants produced by chemical plants is 
known to resemble lognormal distributions and is used to 
evaluate issues regarding compliance of government regu- 
lations. The concentration of a certain pollutant, in parts 
per million (ppm), is assumed lognormal with parameters 
/J. = 4.6 and (7= 1.5. What is the probability that the concen- 
tration exceeds 10 ppm? 

One can use Eq. 2.39, or simpler still, use the z tables 
(Table A3) by suitable transformations of the random variable. 



L(X>10) = iV[ln(10),4.6, 1.5] =N 
= A'(- 1.531) = 0.0630 



In (10) -4.6 
L5 
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(d) Gamma Distribution. There are several processes whe- 
re distributions other than the normal distribution are war- 
ranted. A distribution which is useful since it is versatile in 
the shapes it can generate is the gamma distribution (also 
called the Erlang distribution). It is a good candidate for mo- 
deling random phenomena which can only be positive and 
are unimodal. The gamma distribution is derived from the 
gamma function for positive values of a, which one may re- 
call from mathematics, is defined by the integral: 



X 



-' e-Mx 



(2.40a) 



bution which applied to the discrete case. It is used to mo- 
del the interval between two occurrences, e.g. the distance 
between consecutive faults in a cable, or the time between 
chance failures of a component (such as a fuse) or a system, 
or the time between consecutive emissions of a-particles, or 
the time between successive arrivals at a service facility. Its 
PDF is given by 



E(x;l) = X ■ e-^' 
= 



if x>0 
otherwise 



(2.41a) 



where 1 is the mean value per unit time or distance. The 
mean and variance of the exponential distribution are: 



Recall that for non-negative integers k: 



^l = X/k and CT^ = X/k^ 



(2.41b) 



V(k+ 1) = A:! 



The continuous random variable X has a gamma distribu- 
tion with positive parameters a and I if its density function 
is given by: 



G{x\a,k)^k"e 



= 



-Xx 



(« - 1)! 



X >0 



elsewhere 



(2.40b) The distribution is represented by a family of curves for dif- 
ferent values of! (see Fig. 2.18). Exponential failure laws 
apply to products whose current age does not have much ef- 
fect on their remaining lifetimes. Hence, this distribution is 
said to be "memoryless". Notice the relationship between the 
exponential and the Poisson distributions. While the latter 
represents the number of failures per unit time, the exponen- 

(Z.4UC) jj^j represents the time between successive failures. Its CDF 
is given by: 



The mean and variance of the gamma distribution are: 

At = a/k and a^ = a/k^ (2.40d) 



C£>F[E(a,A.)] 



a 



-Xx 



dx — \ — e 



-Xa 



(2.41c) 



Variation of the parameter a (called the shape factor) and i 
(called the scale parameter) allows a wide variety of shapes 
to be generated (see Fig. 2.17). From Fig. 2.9, one notes that 
the Gamma distribution is the parent distribution of many 
other distributions discussed below. If a ^^ oo and k — I, 
the gamma distribution approaches the normal (see Fig. 2.9). 
When a = 1, one gets the exponential distribution. When 
a — v/2 and k — 1/2, one gets the chi-square distribution 
(discussed below). 

(e) Exponential Distribution. A special case of the gam- 
ma distribution for a = 1 is the exponential distribution. It is 
the continuous distribution analogue to the geometric distri- 



Example 2.4.12: Temporary disruptions to the power grid 
can occur due to random events such as lightning, trans- 
former failures, forest fires,.. The Poisson distribution has 
been known to be a good function to model such failures. 
If these occur, on average, say, once every 2.5 years, then 1 
=1/2.5 =0.40 per year. 

(a) What is the probability that there will be at least one 
disruption next year? 



CDF[E(X < 1;1)] = 1-e 
= 0.3297 



0.4(1) 



= 1 - 0.6703 



Fig. 2.17 Gamma distributions 
for different combinations of the 
shape parameter a and the scale 

parameter /?= 1/1 
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Fig. 2.18 Exponential distributions for three different values of the 
parameter 1 



(b) What is the probabihty that there will be no more than 

two disruptions next year? 
This is the complement of at least two disruptions. 



Probabihty = 1 - CDF[EiX < 2; X)] 



1-[1 



e 



-0.4(2) 



] = 0.4493 



(f) WeibuU Distribution. Another versatile and widely used 
distribution is the Weibull distribution which is used in ap- 
plications involving reliability and life testing; for example, 
to model the time of failure or life of a component. The con- 
tinuous random variable X has a Weibull distribution with 
parameters a and /? (shape and scale factors respectively) if 
its density function is given by: 



PF(x;a,/^)= — -x" 



■exp[-(x/;S)''] forx>0 



(2.42a) 



elsewhere 



with mean 



/x 



r 1 + 



(2.42b) 



Figure 2.19 shows the versatility of this distribution for dif- 
ferent sets of a and fi values. Also shown is the special case 



of W(l,l) which is the exponential distribution. ForyS> 1, the 
curves become close to bell-shaped and somewhat resemble 
the normal distribution. The Weibull distribution has been 
found to be very appropriate to model reliability of a system 
i.e., the failure time of the weakest component of a system 
(bearing, pipe joint failure,. . .). 

Example 2.4.13: Modeling wind distributions using the 
Weibull distribution 

The Weibull distribution is also widely used to model the 
hourly variability of wind velocity in numerous locations 
worldwide. The mean wind speed and its distribution on an 
annual basis, which are affected by local climate conditi- 
ons, terrain and height of the tower, are important in order 
to determine annual power output from a wind turbine of a 
certain design whose efficiency changes with wind speed. It 
has been found that the shape factor a varies between 1 and 
3 (when a= 2, the distribution is called the Rayleigh distribu- 
tion). The probability distribution shown in Fig. 2.20 has a 
mean wind speed of 7 m/s. Determine: 

(a) the numerical value of the parameter p assuming the 
shape factor a = 2 

One calculates the gamma function r(l + i) = 0.8862 

from which P = g^^ = 7.9 

(b) using the PDF given by Eq. 2.42, it is left to the rea- 
der to compute the probability of the wind speed being 
equal to 10 m/s (and verify the solution against the figu- 
re which indicates a value of 0.064). ■ 

(g) Clii-square Distribution. A third special case of the 
gamma distribution is when a — v/2 and 1=1/2 where v 
is a positive integer, and is called the degrees of freedom. 
This distribution called the chi-square (X ) distribution plays 
an important role in inferential statistics where it is used as 
a test of significance for hypothesis testing and analysis of 
variance type of problems. Just like the t-statistic, there is a 
family of distributions for different values of v (Fig. 2.21). 
Note that the distribution cannot assume negative values, 
and that it is positively skewed. Table A5 assembles critical 
values of the Chi-square distribution for different values of 
the degrees of freedom parameter v and for different signifi- 



Fig. 2.19 Weibull distributions 
for different values of the two 

parameters a and /? (the shape 
and scale factors respectively) 
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Fig. 2.20 PDF of the WeibuU distribution W(2, 7.9) 
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Fig. 2.22 Typical F distributions for two different combinations of the 
random variables (uiandD2) 

butions for different combinations of these two parameters, 
and its use will be discussed in Sect. 4.2. 

(i) Uniform Distribution. The uniform probability distribu- 
tion is the simplest of all PDFs and applies to both continu- 
ous and discrete data whose outcomes are all equally likely, 
i.e. have equal probabilities. Flipping a coin for heads/tails 
or rolling a dice for getting numbers between 1 and 6 are 
examples which come readily to mind. The probability den- 
sity function for the discrete case where X can assume values 
Xj, X2,...X|^is given by: 



U{x;k). 



1 



(2.44a) 



Fig. 2.21 Chi-square distributions for different values of the variable Xi 
denoting the degrees of freedom 

cance levels. The usefulness of these tables will be discussed 
in Sect. 4.2. 

The PDF of the chi-square distribution is: 



X {x;v)^ 



1 



,./2-I . ^-x/l ^ ^ Q 



2''/2r(u/2) '^ " '" " (2.43a) 

= elsewhere 

while the mean and variance values are : 



fji — V and a — 2v 



(2.43b) 



(h) F-Distribution. While the t-distribution allows com- 
parison between two sample means, the F distribution all- 
ows comparison between two or more sample variances. It 
is defined as the ratio of two independent chi-square ran- 
dom variables, each divided by its degrees of freedom. The 
F distribution is also represented by a family of plots (see 
Fig. 2.22) where each plot is specific to a set of numbers re- 
presenting the degrees of freedom of the two random variab- 
les (Vj, v^). Table A6 assembles critical values of the F-distri- 



k 

with mean /x — — — and 
'^ k 



(2.44b) 



2 ' = 1 

variance a — 



E i^i - i^f 



For random variables that are continuous over an interval 
(c,d) as shown in Fig. 2.23, the PDF is given by: 



(/(x) = 



1 



when c < X < d 



d ~ c 

otherwise 



(2.44c) 



The mean and variance of the uniform distribution (using 
notation shown in Fig. 2.23) are given by: 



Fig. 2.23 The uniform distribu- 
tion assumed continuous over the 
interval [c, dj 
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u — and CT — (2.44d) 

'^2 12 

The probability of random variable X being between say x^ 
and X, is: 



t/(xi < X < X2) = 



X2 -Xi 

d-c 



Example 2.4.14: A random variable X has a uniform dis- 
tribution with c=-5 and d= 10 (see Fig. 2.23). Determine: 

(a) On an average, what proportion will have a negative va- 

lue? (Answer: 1/3) 

(b) On an average, what proportion will fall between -2 and 

2? (Answer: 4/15) ■ 

(j) Beta Distribution. A very versatile distribution is the 
Beta distribution which is appropriate for discrete random 
variables between and 1 such as representing proportions. 
It is a two parameter model which is given by: 



Beta{x;p,q) — 



(Z^ + g + l)! 
{p - mq - 1) 



xP-\\-xf-' (2.45a) 



Depending on the values of p and q, one can model a wide 
variety of curves from u shaped ones to skewed distributi- 
ons (see Fig. 2.24). The distributions are symmetrical when 
p and q are equal, with the curves becoming peakier as the 
numerical values of the two parameters increase. Skewed 
distributions are obtained when the parameters are unequal. 



The mean of the Beta distribution /x — 



P + ' 



variance a — 



and 



pq 



(p + qf{p + q+l) 
(2.45b) 

This distribution originates from the Binomial distribution, 
and one can detect the obvious similarity of a two-outcome 
affair with specified probabilities. The usefulness of this dis- 
tribution will become apparent in Sect. 2.5.3 dealing with the 
Bayesian approach to probability problems. 



2.5 Bayesian Probability 

2.5.1 Bayes'Theorem 

It was stated in Sect. 2.1.4 that the Bayesian viewpoint can 
enhance the usefulness of the classical frequentist notion 
of probability'. Its strength lies in the fact that it provides 
a framework to include prior information in a two-stage (or 
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Fig. 2.24 Various shapes assumed by the Beta distribution depending 
on the vakies of the two model parameters 



multi-stage) experiment. If one substitutes the term p(A) in 
Eq. 2.12 by that given by Eq. 2.1 1, one gets : 



piB/A) = 



p{A n B) 
p(A nB) + p(AnB) 



(2.46) 



^ There are several texts which deal with Bayesian statistics; for exam- 
ple, Bolstad (2004). 



Also, one can re-arrange Eq. 2.12 into: p(AriB) = 
p(A) ■ p(B/A) or = p(B) ■ p(A/B) . This allows expressing 
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Eq. 2.46 into the following expression referred to as the law 
of total probability or Bayes ' theorem: 



PiB/A) = 



pjA/B) ■ pjB) 
p{A/B)-p{B) + p(A/B)-p{B) 



(2.47) 



Bayes theorem, superficially, appears to be simply a res- 
tatement of the conditional probability equation given by 
Eq. 2.12. The question is why is this reformulation so in- 
sightful or advantageous? First, the probability is now re- 
expressed in terms of its disjoint parts {B,B}, and second 
the probabilities have been "flipped", i.e., p(B/A) is now ex- 
pressed in terms of p(A/B). Consider the two events A and 
B. If event A is observed while event B is not, this expression 
allows one to infer the "flip" probability, i.e. probability of 
occurrence of B from that of the observed event A. In Baye- 
sian terminology, Eq. 2.47 can be written as: 

Posterior probability of event B given event A 

(Likelihood of A given B) ■ (Prior probability of B) 



Prior probability of A 



(2.48) 



Thus, the probability p(B) is called the prior probability (or 
unconditional probability) since it represents opinion before 
any data was collected, while p(B/A) is said to be the poste- 
rior probability which is reflective of the opinion revised in 
light of new data. The likelihood is identical to the conditio- 
nal probability of A given B i.e., p(A/B). 

Equation 2.47 applies to the case when only one of two 
events is possible. It can be extended to the case of more than 
two events which partition the space S. Consider the case 
where one has n events, B,...B which are disjoint and make 
up the entire sample space. Figure 2.25 shows a sample spa- 
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Fig. 2.25 Bayes theorem for multiple events depicted on a Venn dia- 
gram. In this case, the sample space is assumed to be partitioned into 
four discrete events 6^ . . .S^. If an observable event A has already occur- 



red, the conditional probability of -63 : P(B3/A) - 

ratio of the hatched area to the total area inside the ellipse 



IHA) 



. This is the 



ce of 4 events. Then, the law of total probability states that 
the probability of an event A is the sum of its disjoint parts: 

n n 

p{A) = E ^(^ n ^^■) = E p^^i^i^ ■ p^^j^ ^^-"^^^ 

p(AnBi) p(A/Bi)- p(Bi) 

Then piBi/A) = — = J^^ / '^ t"^ •> (2.5O) 



posterior 
probability 



P(A) 



^p{A/Bj)-piBj) 

likelihood prior 



which is known as Bayes ' theorem for multiple events. As be- 
fore, the marginal or prior probabilities p(Bi) for / = 1, ..., « 
are assumed to be known in advance, and the intention is to 
update or revise our "belief on the basis of the observed 
evidence of event A having occurred. This is captured by the 
probability p{Bi/A) for / = 1, ..., n called the posterior pro- 
bability or the weight one can attach to each event B after 
event A is known to have occurred. 

Example 2.5.1: Consider the two-stage experiment of 
Example 2.2.7. Assume that the experiment has been per- 
formed and that a red marble has been obtained. One can 
use the information known beforehand i.e., the prior proba- 
bilities R, W and G to determine from which box the mar- 
ble came from. Note that the probability of the red marble 
having come from box A represented by p(A/R) is now the 
conditional probability of the "flip" problem. This is called 



IVIarble 
color 




Fig. 2.26 The probabilities of the reverse tree diagram at each stage 
are indicated. If a red marble (R) is picked, the probabilities that it came 
from either Box A or Box B are 2/5 and 3/5 respectively 
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Fig. 2.27 The forward tree 
diagram showing the four events 
which may result when monito- 
ring the performance of a piece 
of equipment 



Outcome 



Fault-free 




State of 
Probability Diagnosis equipment 

0.9405 Fine Fine 



0.0495 Faulty Fine 



0.009 



0.001 



False alarm 



Faulty 



Fine 



Faulty 



Faulty 



Missed 
opportunity 



the posterior probabilities of event A with R having occur- 
red, i.e., they are relevant after the experiment has been per- 
formed. Thus, from the law of total probability (Eq. 2.47): 



PiB/R) 



and 



p{A/R) = 



1 3 




2'4 


3 


11 13 

2'2 "*" 2'4 


5 


1 1 




2'2 


2 


11 13 


5 



2"2 "*" 24 



The reverse probability tree for this experiment is shown in 
Fig. 2.26. The reader is urged to compare this with the for- 
ward tree diagram of Example 2.2.7. The probabilities of 1.0 
for both W and G outcomes imply that there is no uncertainty 
at all in predicting where the marble came from. This is ob- 
vious since only Box A contains W, and only Box B contains 
G. However, for the red marble, one cannot be sure of its 
origin, and this is where a probability measure has to be de- 
termined. ■ 



Example 2.5.2: Forward and reverse probability trees for 
fault detection of equipment 

A large piece of equipment is being continuously monitored 
by an add-on fault detection system developed by another 
vendor in order to detect faulty operation. The vendor of the 
fault detection system states that their product correctly iden- 
tifies faulty operation when indeed it is faulty (this is refer- 
red to as sensitivity) 90% of the time. This implies that there 
is a probability p = 0.10 of a false negative occurring (i.e., 
a missed opportunity of signaling a fault). Also, the vendor 
quoted that the correct status prediction rate or specificity of 
the detection system (i.e., system identified as healthy when 
indeed it is so) is 0.95, implying that the false positive or 



No alarm 




Alarm 



B Missed opportunity 



A False alarm 



B 



Fig. 2.28 Reverse tree diagram depicting two possibilities. If an alarm 
sounds, it could be either an erroneous one (outcome A from A2) or 
a valid one (B from Bl). Further, if no alarm sounds, there is still the 
possibility of missed opportunity (outcome B from B2). The probability 
that it is a false alarm is 0.846 which is too high to be acceptable in 
practice. How to decrease this is discussed in the text 



false alarm rate is 0.05. Finally, historic data seem to indicate 
that the large piece of equipment tends to develop faults only 
1% of the time. 

Figure 2.27 shows how this problem can be systematical- 
ly represented by a forward tree diagram. State A is the fault- 
free state and state B is represented by the faulty state. Fur- 
ther, each of these states can have two outcomes as shown. 
While outcomes Al and Bl represent correctly identified 
fault-free and faulty operation, the other two outcomes are 
errors arising from an imperfect fault detection system. Out- 
come A2 is the "false negative" (or false alarm or error type 
II which will be discussed at length in Sect. 4.2 of Chap. 4), 
while outcome B2 is Ihe. false positive rate (or missed oppor- 
tunity or error type I). The figure clearly illustrates that the 
probabilities of A and B occurring along with the conditional 
probabilities p(Al/A)=0.95 and p(Bl/B) = 0.90, result in the 
probabilities of each the four states as shown in the figure. 

The reverse tree situation, shown in Fig. 2.28, corresponds 
to the following situation. A fault has been signaled. What is 
the probability that this is a false alarm? Using Eq. 2.47: 
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/7(A/A2) = 



(0.99).(0.05) 



(0.99).(0.05) + (0.01).(0.90) 
0.0495 



0.0495 
= 0.846 



0.009 



This is very high for practical situations and could well result 
in the operator disabling the fault detection system altoge- 
ther. One way of reducing this false alarm rate, and thereby 
enhance robustness, is to increase the sensitivity of the de- 
tection device from its current 90% to something higher by 
altering the detection threshold. This would result in a higher 
missed opportunity rate, which one has to accept for the pri- 
ce of reduced false alarms. For example, the current missed 
opportunity rate is: 



p(B/Bl)^ 



(0.01) -(0.10) 



(0.01) -(0.10) 

0.001 
0.001 + 0.9405 



(0.99) ■ (0.95) 
= 0.001 



This is probably lower than what is needed, and so the above 
suggested remedy is one which can be considered. Note that 
as the piece of machinery degrades, the percent of time when 
faults are likely to develop will increase from the current 1% 
to something higher. This will have the effect of lowering the 
false alarm rate (left to the reader to convince himself why). ■ 

Bayesian statistics provide the formal manner by which 
prior opinion expressed as probabilities can be revised in 
the light of new information (from additional data collec- 
ted) to yield posterior probabilities. When combined with 
the relative consequences or costs of being right or wrong, 
it allows one to address decision-making problems as poin- 
ted out in the example above (and discussed at more length 
in Sect. 12.2.9). It has had some success in engineering (as 
well as in social sciences) where subjective judgment, often 
referred to as intuition or experience gained in the field, is 
relied upon heavily. 

The B ayes' theorem is a consequence of the probability 
laws and is accepted by all statisticians. It is the interpreta- 
tion of probability which is controversial. Both approaches 
differ in how probability is defined: 

• classical viewpoint: long run relative frequency of an 
event 

• Bayesian viewpoint: degree of belief held by a person ab- 
out some hypothesis, event or uncertain quantity (Phillips 
1973). 

Advocates of the classical approach argue that human 
judgment is fallible while dealing with complex situations, 
and this was the reason why formal statistical procedures 
were developed in the first place. Introducing the vagueness 
of human judgment as done in Bayesian statistics would di- 
lute the "purity" of the entire mathematical approach. Ad- 



vocates of the Bayesian approach, on the other hand, argue 
that the "personalist" definition of probability should not be 
interpreted as the "subjective" view. Granted that the prior 
probability is subjective and varies from one individual to 
the other, but with additional data collection all these views 
get progressively closer. Thus, with enough data, the initial 
divergent opinions would become indistinguishable. Hence, 
they argue, the Bayesian method brings consistency to infor- 
mal thinking when complemented with collected data, and 
should, thus, be viewed as a mathematically valid approach. 



2.5.2 Application to Discrete Probability 
Variables 

The following example illustrates how the Bayesian appro- 
ach can be applied to discrete data. 

Example 2.5.3:^ Using the Bayesian approach to enhance 
value of concrete piles testing 

Concrete piles driven in the ground are used to provide be- 
aring strength to the foundation of a structure (building, 
bridge,...). Hundreds of such piles could be used in large 
construction projects. These piles could develop defects such 
as cracks or voids in the concrete which would lower com- 
pressive strength. Tests are performed by engineers on piles 
selected at random during the concrete pour process in order 
to assess overall foundation strength. Let the random discrete 
variable be the proportion of defective piles out of the entire 
lot which is taken to assume five discrete values as shown in 
the first column of Table 2.7. Consider the case where the 
prior experience of an engineer as to the proportion of defec- 
tive piles from similar sites is given in the second column of 
the table below. 

Before any testing is done, the expected value of 
the probability of finding one pile to be defective is: 
p = (0.20)(0.30) -I- (0.4)(0.40) + (0.6)(0. 15) + (0.8)(0. 10) + (1.0) 

Table 2.7 Illustration of how a prior PDF is revised with new data 
Proportion Probability of being defective 



of defectives 

(X) 


' Prior 
PDF of 
defectives 


After one 
pile tested 
is found 
defective 


After two 
piles tested 
are found 
defective 


Limiting 

case of 
infinite 
defectives 


0.2 


0.30 


0.136 


0.049 


. 0.0 


0.4 


0.40 


0.364 


0.262 


0.0 


0.6 


0.15 


0.204 


0.221 


0.0 


0.8 


0.10 


0.182 


0.262 


0.0 


1.0 


0.05 


0.114 


0.205 


. 1.0 


Expected 
probability 
of defective 
pile 


0.44 


0.55 


0.66 


1.0 













From Ang and Tang (2007) by permission of John Wiley and Sons. 
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PDF 

1 

0.9 
0.8 
0.7 
0.6 
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0.4 
0.3 
0.2 
0.1 




Prior To Testing 



PDF 
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0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 





m 



0.2 



0.4 



0.6 



0.8 



1.0 



After Failure of Two Succesive Piles Tested 



0.2 
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0.6 



0.8 



1.0 



PDF 

1 

0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 




After Failure of First Pile Tested 
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PDF 
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0.8 - 
0.7 - 
0.6- 
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0.3 - 
0.2 - 
0.1 - 
0.0- 



m 



0.2 0.4 0.6 0.8 

Limiting Case of All Tested Piles Failing 



1.0 



0.2 



0.4 



0.6 



0.8 



1.0 



Proportion of defectives 

Fig. 2.29 Illustration of how the prior discrete PDF is affected by data collection following Bayes' theorem 



(0.05) = 0.44 (as shown in the last row under the second column). 
This is the prior probability. 

Suppose the first pile tested is found to be defective. How 
should the engineer revise his prior probability of the pro- 
portion of piles likely to be defective? This is given by Bay- 
es' theorem (Eq. 2.50). For proportion x = 0.2, the posterior 
probability is: 



lues of X can be determined as well as the expected value E 
(x= 1) which is 0.55. Hence, a single inspection has led to the 
engineer revising his prior opinion upward from 0.44 to 0.55. 
Had he drawn a conclusion on just this single test without 
using his prior judgment, he would have concluded that all the 
piles were defective; clearly, an over-statement. The engineer 
would probably get a second pile tested, and if it also turns 



fix = 0.2) 



(0.2)(0.3) 



(0.2)(0.3) ■ 
_ 0.06 
~ 0!44 
= 0.136 



(0.4)(0.4) + (0.6)(0.15) + (0.8)(0.10) + (1.0)(0.05) 



This is the value which appears in the first row under the third 
column. Similarly the posterior probabilities for different va- 



out to be defective, the associated probabilities are shown in 
the fourth column of Table 2.7. For example, for x=0.2: 



p(x = 0.2) = 



(0.2)(0.136) + (0.4)(0.364)- 



(0.2)(0.136) 
(0.6)(0.204) 



(0.8)(0.182) + (1.0)(0.114) 



= 0.049 
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Table 2.8 Prior pdf of defective proportion 




X 0.1 


0.2 


f(x) 0.6 


0.4 



The expected value in this case increases to 0.66. In the Hmit, 
if each successive pile tested turns out to be defective, one 
gets back the classical distribution, listed in the last column 
of the table. The progression of the PDF from the prior to the 
infinite case is illustrated in Fig. 2.29. Note that as more piles 
tested turn out to be defective, the evidence from the data 
gradually overwhelms the prior judgment of the engineer. 
However, it is only when collecting data is so expensive or 
time consuming that decisions have to be made from limited 
data that the power of the Bayesian approach becomes evi- 
dent. Of course, if one engineer's prior judgment is worse 
than that of another engineer, then his conclusion from the 
same data would be poorer than the other engineer. It is this 
type of subjective disparity which antagonists of the Baye- 
sian approach are uncomfortable with. On the other hand, 
proponents of the Bayesian approach would argue that expe- 
rience (even if intangible) gained in the field is a critical asset 
in engineering applications and that discarding this type of 
knowledge entirely is naive, and a severe handicap. ■ 

There are instances when no previous knowledge or infor- 
mation is available about the behavior of the random variab- 
le; this is sometime referred to as prior of pure ignorance. It 
can be shown that this assumption of the prior leads to results 
identical to those of the traditional probability approach (see 
Examples 2.5.5 and 2.5.6). 

Example 2.5.4:'' Consider a machine whose prior pdf of the 
proportion x of defectives is given by Table 2.8. 
If a random sample of size 2 is selected, and one defective 
is found, the Bayes estimate of the proportion of defectives 
produced by the machine is determined as follows. 

Let y be the number of defectives in the sample. The pro- 
bability that the random sample of size 2 yields one defective 
is given by the Binomial distribution since this is a two-out- 
come situation: 



Thus, the total probability of finding one defective in a 
sample size of 2 is: 

/(y = l) = (0.18)(0.6) + (0.32)(0.40) 
= (0.108) + (0.128) 
= 0.236 

The posterior probability f(x/y = 1) is then given: 

• for x=0.1: 0.108/0.236=0.458 

• for x=0.2: 0.128/0.236=0.542 

Finally, the Bayes' estimate of the proportion of defecti- 
ves X is: 

X = (0.1)(0.458) + (0.2)(0.542) = 0.1542 

which is quite different from the value of 0.5 given by the 
classical method. ■ 



2.5.3 Application to Continuous Probability 
Variables 

The Bayes' theorem can also be extended to the case of 
continuous random variables (Ang and Tang 2007). Let x 
be the random variable with a prior PDF denoted by p(x). 
Though any appropriate distribution can be chosen, the 
Beta distribution (given by Eq. 2.45) is particularly conve- 
nient^ and is widely used to characterize prior PDF. Anot- 
her commonly used prior is the uniform distribution called 
a diffuse prior. 

For consistency with convention, a slightly different no- 
menclature than that of Eq. 2.50 is adopted. Assuming the 
Beta distribution, Eq. 2.45a can be rewritten to yield the prior: 



p(x) ex x"(l - xf 



(2.51) 



f(y/x)^ B{y;n,x)^ 



Recall that higher the values of the exponents a and b, the 
peakier the distribution indicative of the prior distribution 
being relatively well defined. 

Let L(x) represent the conditional probability or likeli- 
hood function of observing y "successes" out of n observati- 
ons. Then, the posterior probability is given by: 



x>'(l -xf-^; J =0,1,2 



f(x/y) oc L(x) ■ p{x) 



(2.52) 



Ifx=0.1,then 



/(l/0.1)=fl(l;2,0.1)=( : )(0.1)'(0.9)2-' 



= 0.18 



Similarly, for x = 0.2, /( 1/0.2) = 0.32 . 



From Walpole et al. (2007) by pemiission of Pearson Education. 



In the context of Fig. 2.25, the likelihood of the unobservable 
events B^ . . .B^ is the conditional probability that A has oc- 
curred given B. for ( = 1, . . ., n, or by p(AIB). The likelihood 
function can be gleaned from probability considerations in 
many cases. Consider Example 2.5.3 involving testing the 
foundation piles of buildings. The Binomial distribution gi- 
ves the probability of x failures in n independent Bernoulli 



' Because of the corresponding mathematical simpUcity which it provi- 
des as well as the abiUty to capture a wide variety of PDF shapes 
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trials, provided the trials are independent and the probability 
of failure in any one trial is p. This applies to the case when 
one holds p constant and studyies the behavior of the pdf of 
defectives x. If instead, one holds x constant and lets p(x) 
vary over its possible values, one gets the likelihood func- 
tion. Suppose n piles are tested and y piles are found to be 
defective or sub-par. In this case, the likelihood function is 
written as follows for the Binomial PDF: 



Finally, the Bayes' estimate of the proportion of defecti- 
ves X is: 



' = '/ 



x^{l 



■ x)dx — 0.5 



which can be compared to the value of 0.5 given by the clas- 
sical method. ■ 



L{x) 



x^(l -x) 



n-y 



0<x<l (2.53) 



Notice that the Beta distribution is the same form as the li- 
kelihood function. Consequently, the posterior distribution 
given by Eq. 2.53 assumes the form: 



f{x/y) 



r^'+y 



(1-x) 



b+n-y 



(2.54) 



where k is independent of x and is a normalization constant. 
Note that (1/k) is the denominator term in Eq. 2.54 and is es- 
sentially a constant introduced to satisfy the probability law 
that the area under the PDF is unity. What is interesting is 
that the information contained in the prior has the net result 
of "artificially" augmenting the number of observations ta- 
ken. While the classical approach would use the likelihood 
function with exponents y and (n-y) (see Eq. 2.51), these 
are inflated to (a-ny) and (b-nn-y) in Eq. 2.54 for the poste- 
rior distribution. This is akin to having taken more observa- 
tions, and supports the previous statement that the Bayesian 
approach is particularly advantageous when the number of 
observations is low. 

Three examples illustrating the use of Eq. 2.54 are given 
below. 

Example 2.5.5: Repeat Example 2.5.4 assuming that no in- 
formation is known about the prior. In this case, assume a 
uniform distribution. 
The prior pdf can be found from the Binomial distribution: 

f{y/x)^B{\-2,x)^( \ \x\\-xf-' 

— 2x(l — x) 
The total probability of one defective is now given by: 



Ky 



--f 



2x(l — x)dx — 



1 



The posterior probability is then found by dividing the above 
two expressions (Eq. 2.54): 



f(x/y = 1) 



2x(l — x) 
1/3 



6x(l — x) 



Example 2.5.6: Let us consider the same situation as that 
treated in Example 2.5.3. However, the proportion of defecti- 
ves X is now a continuous random variable for which no prior 
distribution can be assigned. This implies that the engineer 
has no prior information, and in such cases, a uniform distri- 
bution is assumed: 

p{x)^\.0 for 0<x<l 

The likelihood function for the case of the single tested pile 
turning out to be defective is x, i.e. L(x)=x. The posterior 
distribution is then: 

/(x/y) = ^.x(1.0) 

The normalizing constant 



k^ 



1 

/ 






xdx 



= 2 



Hence, the posterior probability distribution is: 
f{x/y) = 2x for < x < 1 
The Bayesian estimate of the proportion of defectives is: 



P = E(x/y) 



1 

-f 



X ■ 2xdx — 0.667 



Example 2.5.7:* Enhancing historical records of wind velo- 
city using the Bayesian approach 

Buildings are designed to withstand a maximum wind speed 
which depends on the location. The probability x that the 
wind speed will not exceed 1 20 km/h more than once in 5 
years is to be determined. Past records of wind speeds of a 
nearby location indicated that the following beta distribution 
would be an acceptable prior for the probability distribution 
(Eq. 2.45): 

p(x) = 20x^(1 - x) for < X < 1 

In this case, the likelihood that the annual maximum wind 
speed will exceed 120 km/h in 1 out of 5 years is given by: 



L(x) = 



x\l -x) = 5x''(l 



■X) 



From Ang and Tang (2007) by permission of John Wiley and Sons. 
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Likelihood 



Fig. 2.30 Probability distributions of tlie prior, lilielihood function and 
the posterior. (From Ang and Tang 2007 by permission of Jolin Wiley 
and Sons) 

Hence, the posterior probability is deduced following 
Eq. 2.54: 

f{x/y) = k ■ [5x\\ - x)] ■ [2Qx\\ - x)] 
= lOOjfe ■ x^ ■ (1 - xf 

where the constant k can be found from the normalization 
criterion: 

-1 

fdx 
LO 
Finally, the posterior PDF is given by 



yt = 



1 



I \00x\l-x) 



= 3.6 



fix/y) = 360x^(1 -xf for < x < 1 

Plots of the prior, likelihood and the posterior functions 
are shown in Fig. 2.30. Notice how the posterior distribu- 
tion has become more peaked reflective of the fact that the 
single test data has provided the analyst with more informa- 
tion than that contained in either the prior or the likelihood 
function. ■ 



2.6 Probability Concepts and Statistics 

The distinction between probability and statistics is often not 
clear cut, and sometimes, the terminology adds to the con- 
fusion'. In its simplest sense, probability generally allows 
one to predict the behavior of the system "before" the event 
under the stipulated assumptions, while statistics refers to a 
body of knowledge whose application allows one to make 
sense out of the data collected. Thus, probability concepts 
provide the theoretical underpinnings of those aspects of 
statistical analysis which involve random behavior or noise 
in the actual data being analyzed. Recall that in Sect. 1.5, a 



' For example, "statistical mechanics" in physics has nothing to do with 
statistics at all but is a type of problem studied under probability. 



distinction had been made between four types of uncertainty 
or unexpected variability in the data. The first was due to the 
stochastic or inherently random nature of the process itself 
which no amount of experiment, even if done perfectly, can 
overcome. The study of probability theory is mainly mathe- 
matical, and applies to this type, i.e., to situations/processes/ 
systems whose random nature is known to be of a certain 
type or can be modeled so that its behavior (i.e., certain 
events being produced by the system) can be predicted in 
the form of probability distributions. Thus, probability deals 
with the idealized behavior of a system under a known type 
of randomness. Unfortunately, most natural or engineered 
systems do not fit neatly into any one of these groups, and so 
when performance data is available of a system, the objective 
may be: 

(i) to try to understand the overall nature of the system 
from its measured performance, i.e., to explain what 
caused the system to behave in the manner it did, and 
(ii) to try to make inferences about the general behavior of 
the system from a limited amount of data. 
Consequently, some authors have suggested that probabi- 
lity be viewed as a "deductive" science where the conclusion 
is drawn without any uncertainty, while statistics is an "in- 
ductive" science where only an imperfect conclusion can be 
reached, with the added problem that this conclusion hinges 
on the types of assumptions one makes about the random 
nature of the underlying drivers ! Here is a simple example 
to illustrate the difference. Consider the flipping of a coin 
supposed to be fair. The probability of getting "heads" is Vi. 
If, however, "heads" come up 8 times out of the last 10 trials, 
what is the probability the coin is not fair? Statistics allows 
an answer to this type of enquiry, while probability is the 
approach for the "forward" type of questioning. 

The previous sections in this chapter presented basic no- 
tions of classical probability and how the Bayesian viewpo- 
int is appropriate for certain types of problems. Both these 
viewpoints are still associated with the concept of probabi- 
lity as the relative frequency of an occurrence. At a broader 
context, one should distinguish between three kinds of pro- 
babilities: 

(i) Objective or absolute probability which is the classi- 
cal one where it is interpreted as the "long run fre- 
quency". This is the same for everyone (provided the 
calculation is done correctly!). It is an informed guess 
of an event which in its simplest form is a constant; 
for example, historical records yield the probability of 
flood occurring this year or of the infant mortality rate 
in the U.S. 

Table 2.9 assembles probability estimates for the occur- 
rence of natural disasters with 10 and 1000 fatalities per 
event (indicative of the severity level) during different 
time spans (1, 10 and 20 years). Note that floods and 
tornados have relatively small return times for small 
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Table 2.9 Estimates of absolute probabilities for different natural 


disasters in 


the United States. (Adapted from Barton and Nishenko 2008) 


Exposure Times 
Disaster 


10 fatalities per event 
1 year 10 years 


20 years 


Return time 


(yrs) 


1000 fatalities per event 
1 year 10 years 


20 years 


Return time (yrs) 


Earthquakes 


0.11 


0.67 


0.89 


9 






0.01 


0.14 


0.26 


67 


Hurricanes 


0.39 


0.99 


>0.99 


2 






0.06 


0.46 


0.71 


16 


Floods 


0.86 


>0.99 


>0.99 


0.5 






0.004 


0.04 


0.08 


250 


Tornadoes 


0.96 


>0.99 


>0.99 


0.3 






0.006 


0.06 


0.11 


167 



Table 2.10 Leading causes of death 
ted from KoUuru et al. 1996) 


in the United States, 


1992. (Adap- 


Cause 


Annual deaths 
(X 1000) 


Percent 

% 


Cardiovascular or heart disease 


720 


33 


Cancer (malignant neoplasms) 


521 


24 


Cerebrovascular diseases (strokes) 


144 


7 


Pulmonary disease (bronchitis, 
asthma..) 


91 


4 


Pneumonia and influenza 


76 


3 


Diabetes mellitus 


50 


2 


Nonmotor vehicle accidents 


48 


2 


Motor vehicle accidents 


42 


2 


fflV/ATOS 


34 


1.6 



Suicides 


30 


1.4 


Homicides 


27 


1.2 


All other causes 


394 


18 


Total annual deaths (rounded) 


21,77 


100 



events while earthquakes and hurricanes have relatively 
short times for large events. Such probability conside- 
rations can be determined at a finer geographical scale, 
and these play a key role in the development of codes 
and standards for designing large infrastructures (such 
as dams) as well as small systems (such as residential 
buildings), 
(ii) Relative probability where the chance of occurrence of 
one event is stated in terms of another. This is a way 
of comparing the effect of different types of adverse 
events happening on a system or on a population when 
the absolute probabilities are difficult to quantify. For 
example, the relative risk for lung cancer is (approxi- 
mately) 10 if a person has smoked before, compared to 
a nonsmoker. This means that he is 10 times more likely 
to get lung cancer than a nonsmoker. Table 2.10 shows 
leading causes of death in the United States in the year 
1992. Here the observed values of the individual num- 
ber of deaths due to various causes are used to determine 
a relative risk expressed as % in the last column. Thus, 
heart disease which accounts for 33% of the total deaths 
is more than 16 times more likely than motor vehicle 
deaths. However, as a note of caution, these are values 
aggregated across the whole population, and need to be 



interpreted accordingly. State and government analysts 
separate such relative risks by age groups, gender and 
race for public policy-making purposes, 
(iii) Subjective probability which differs from one person 
to another is an informed or best guess about an event 
which can change as our knowledge of the event in- 
creases. Subjective probabilities are those where the 
objective view of probability has been modified to treat 
two types of events: (i) when the occurrence is unique 
and is unlikely to repeat itself, or (ii) when an event 
has occurred but one is unsure of the final outcome. 
In such cases, one still has to assign some measure of 
likelihood of the event occurring, and use this in our 
analysis. Thus, a subjective interpretation is adopted 
with the probability representing a degree of belief of 
the outcome selected as having actually occurred. There 
are no "correct answers", simply a measure reflective of 
our subjective judgment. A good example of such sub- 
jective probability is one involving forecasting the pro- 
bability of whether the impacts on gross world product 
of a 3°C global climate change by 2090 would be large 
or not. A survey was conducted involving twenty lea- 
ding researchers working on global warming issues but 
with different technical backgrounds, such as scientists, 
engineers, economists, ecologists, and politicians who 
were asked to assign a probability estimate (along with 
10% and 90% confidence intervals). Though this was 
not a scientific study as such since the whole area of 
expert opinion elicitation is still not fully mature, there 
was nevertheless a protocol in how the questioning was 
performed, which led to the results shown in Fig. 2.31. 
The median, 10% and 90% confidence intervals predic- 
ted by different respondents show great scatter, with the 
ecologists estimating impacts to be 20-30 times higher 
(the two right most bars in the figure), while the eco- 
nomists on average predicted the chance of large con- 
sequences to have only a 0.4% loss in gross world pro- 
duct. An engineer or a scientist may be uncomfortable 
with such subjective probabilities, but there are certain 
types of problems where this is the best one can do with 
current knowledge. Thus, formal analysis methods have 
to accommodate such information, and it is here that 
Bayesian techniques can play a key role. 
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Fig. 2.31 Example illustrating large differences in subjective probabi- 
lity. A group of prominent economists, ecologists and natural scientists 
were polled so as to get their estimates of the loss of gross world pro- 
duct due to doubling of atmosphereic carbon dioxide (which is likely to 
occur by the end of the twenty-first century when mean global tempe- 
ratures increase by 3°C). The two ecologists predicted the highest ad- 
verse impact while the lowest four individuals were economists. (From 
Nordhaus 1994) 



Problems 

Pr. 2.1 An experiment consists of tossing two dice. 

(a) List all events in the sample space 

(b) What is the probability that both outcomes will have the 
same number showing up both times? 

(c) What is the probability that the sum of both numbers 
equals 10? 

Pr. 2.2 Expand Eq. 2.9 valid for two outcomes to three out- 
comes: p(AU BLIC)^ .... 

Pr. 2.3 A solar company has an inspection system for bat- 
ches of photovoltaic (PV) modules purchased from different 
vendors. A batch typically contains 20 modules, while the 
inspection system involves taking a random sample of 5 mo- 
dules and testing all of them. Suppose there are 2 faulty mo- 
dules in the batch of 20. 

(a) What is the probability that for a given sample, there 
will be one faulty module? 

(b) What is the probability that both faulty modules will be 
discovered by inspection? 

Pr. 2.4 A county office determined that of the 1000 homes 
in their area, 600 were older than 20 years (event A), that 
500 were constructed of wood (event B), and that 400 had 
central air conditioning (AC) (event C). Further, it is found 
that events A and B occur in 300 homes, that all three events 
occur in 150 homes, and that no event occurs in 225 homes. 



If a single house is picked, determine the following proba- 
bilities: 

(a) that it is older than 20 years and has central AC? 

(b) that it is older than 20 years and does not have central 
AC? 

(c) that it is older than 20 years and is not made of wood? 

(d) that it has central air and is made of wood? 

Pr. 2.5 A university researcher has submitted three rese- 
arch proposals to three different agencies. Let E^, E, and 
E^ be the events that the first, second and third bids are 
successful with probabilities: p(Ej)=0.15, p(E,)=0.20, 
p(Ej) = 0.10. Assuming independence, find the following 
probabilities: 

(a) that all three bids are successful 

(b) that at least two bids are successful 

(c) that at least one bid is successful 

Pr. 2.6 Consider two electronic components A and B with 
probability rates of failure of p(A) = 0. 1 and p(B) = 0.25 . What 
is the failure probability of a system which involves coimec- 
ting the two components in (a) series and (b) parallel. 

Pr. 2.7* A particular automatic sprinkler system for a high- 
rise apartment has two different types of activation devices 
for each sprinkler head. Reliability of such devices is a mea- 
sure of the probability of success, i.e., that the device will ac- 
tivate when called upon to do so. Type A and Type B devices 
have reliability values of 0.90 and 0.85 respectively. In case, 
a fire does start, calculate: 

(a) the probability that the sprinkler head will be activated 
(i.e., at least one of the devices works), 

(b) the probability that the sprinkler will not be activated at 
all, and 

(c) the probability that both activation devices will work 
properly. 

Pr. 2.8 Consider the two system schematics shown in 
Fig. 2.32. At least one pump must operate when one chiller is 
operational, and both pumps must operate when both chillers 
are on. Assume that both chillers have identical reliabilities 
of 0.90 and that both pumps have identical reliabilities of 
0.95. 

(a) Without any computation, make an educated guess as to 
which system would be more reliable overall when (i) 
one chiller operates, and (ii) when both chillers operate. 

(b) Compute the overall system reliability for each of the 
configurations separately under cases (i) and (ii) defined 
above. 



* From McClave and Benson (1988) by permission of Pearson Educa- 
tion. 
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Fig. 2.32 Two possible system configurations 



Pr. 2.12 Consider the data given in Example 2.2.6 for the 
case of a residential air conditioner. You will use the same 
data to calculate the flip problem using Bayes' law. 

(a) During a certain day, it was found that the air-conditio- 
ner was operating satisfactorily. Calculate the probabi- 
lities that this was a "NH= not hot" day. 

(b) Draw the reverse tree diagram for this case. 

Pr. 2.13 Consider a medical test for a disease. The test has 
a probability of 0.95 of correctly or positively detecting an 
infected person (this is the sensitivity), while it has a proba- 
bility of 0.90 of correctly identifying a healthy person (this 
is called the specificity). In the population, only 3% of the 
people have the disease. 

(a) What is the probability that a person testing positive is 
actually infected? 

(b) What is the probability that a person testing negative is 
actually infected? 



Pr. 2.9 Consider the following CDF: 

F{x) = 1 - exp (- 2x) for x > 
= x<0 



(a) Construct and plot the cumulative distribution function 

(b) What is the probability of x < 2 

(c) What is the probability of 3 < x < 5 



Pr. 2.14 A large industrial firm purchases several new com- 
puters at the end of each year, the number depending on the 
frequency of repairs in the previous year. Suppose that the 
number of computers X purchased each year has the follo- 
wing PDF: 



X 

f(x) 



0.2 



0.3 



0.2 



0.1 



Pr. 2.10 The joint density for the random variables (X,Y) is 
given by: 

f{x,y)—lOxy^ 0<x<y<l 
= elsewhere 

(a) Verify that Eq. 2.19 applies 

(b) Find the marginal distributions of X and Y 

(c) Compute the probability of < x < 1/2, 1/4 < 3; < 1/2 

Pr. 2.11' Let X be the number of times a certain numerical 
control machine will malfunction on any given day. Let Y be 
the number of times a technician is called on an emergency 
call. Their joint probability distribution is: 



f(x,y) 


X 


1 


2 


3 


Y 
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0.05 


0.05 


0.1 
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0.05 


0.1 


0.35 




3 





0.2 


0.1 



(a) Determine the marginal distributions of X and of Y 

(b) Determine the probability p(x < 2, y > 2) 



If the cost of the desired model will remain fixed at $ 1500 
throughout this year and a discount of $ 50j«:^ is credited to- 
wards any purchase, how much can this firm expect to spend 
on new computers at the end of this year? 

Pr. 2.15 Suppose that the probabilities of the number of 
power failure in a certain locality are given as: 



X 





1 


2 


3 


f(x) 


0.4 


0.3 


0.2 


0.1 



From Walpole et al. (2007) by pemiission of Pearson Education. 



Find the mean and variance of the random variable X. 

Pr. 2.16 An electric firm manufacturers a 100 W light bulb, 
which is supposed to have a mean life of 900 and a standard 
deviation of 50 h. Assume that the distribution is symmetric 
about the mean. Determine what percentage of the bulbs fails 
to last even 700 h if the distribution is found to follow: (i) a 
normal distribution, (ii) a lognormal, (iii) a Poisson, and (iii) 
a uniform distribution. 

Pr. 2.17 Sulfur dioxide concentrations in air samples taken 
in a certain region have been found to be well represented by 
a lognormal distribution with parameters fj = 2.l and (t=0.8. 
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(a) What proportion of air samples have concentrations 
between 3 and 6? 

(b) What proportion do not exceed 10? 

(c) What interval (a,b) is such that 95% of all air samples 
have concentration values in this interval, with 2.5% have 
values less than a, and 2.5% have values exceeding b? 

Pr. 2.18 The average rate of water usage (in thousands of 
gallons per hour) by a certain community can be modeled by 
a lognormal distribution with parameters fj=4 and cr=1.5. 
What is the probability that that the demand will: 

(a) be 40,000 gallons of water per hour 

(b) exceed 60,000 gallons of water per hour 

Pr. 2.19 Suppose the number of drivers who travel between 
two locations during a designated time period is a Poisson 
distribution with parameter 1 = 30. In the long run, in what 
proportion of time periods will the number of drivers: 

(a) Be at most 20? 

(b) Exceed 25? 

(c) Be between 10 and 20. 



a random variable. Further, global radiation has an under- 
lying annual pattern due to the orbital rotation of the earth 
around the sun. A widely adopted technique to filter out this 
deterministic trend is: 

(i) to select the random variable not as the daily radiation 
itself but as the daily clearness index K defined as the 
ratio of the daily global radiation on the earth's surface 
for the location in question to that outside the atmo- 
sphere for the same latitude and day of the year, and 
(ii) to truncate the year into 12 monthly time scales since 
the random variable K for a location changes apprecia- 
bly on a seasonal basis. 
Gordon and Reddy (1988) proposed an expression for 
the PDF of the random variable x — {K / K ) where x is 
the monthly mean value of the daily values of K during a 
month. The following empirical model was shown to be of 
general validity in that it applied to several locations (tem- 
perate and tropical) and months of the year with the same 
variance in K: 



p(X;A,M) = Ar'[l-(X/X,„ax)] 



(2.55) 



Pr. 2.20 Suppose the time, in hours, taken to repair a home 
furnace can be modeled as a gamma distribution with para- 
meters a=2 and 1= 1/2. What is the probability that the next 
service call will require: 

(a) at most 1 h to repair the furnace? 

(b) at least 2 h to repair the furnace? 

Pr. 2.21'° In a certain city, the daily consumption of elect- 
ric power, in millions of kilowatts-hours (kWh), is a random 
variable X having a gamma distribution with mean =6 and 
variance =12. 

(a) Find the values of the parameters a and i 

(b) Find the probability that on any given day the daily 
power consumption will exceed 12 million kWh. 

Pr. 2.22 The life in years of a certain type of electrical swit- 
ches has an exponential distribution with an average life in 
years of 1 = 5. If 100 of these switches are installed in diffe- 
rent systems, 

(a) what is the probability that at most 20 will fail during 
the first year? 

(b) How many are likely to have failed at the end of 3 years? 

Pr. 2.23 Probability models for global horizontal solar ra- 
diation. 

Probability models for predicting solar radiation at the sur- 
face of the earth was the subject of several studies in the last 
several decades. Consider the daily values of global (beam 
plus diffuse) radiation on a horizontal surface at a specified 
location. Because of the variability of the atmospheric con- 
ditions at any given location, this quantity can be viewed as 



where A, n and X are model parameters which have been 

max ^ 

determined from the normalization of p(X), knowledge of 
K (i.e., X — \) and knowledge of the variance of X or a\X). 
Derive the following expressions for the three model para- 
meters: 

n = -2.5 + 0.5[9 + {%/a^{X))\''^^ 

^max = (« + 3)/(« + 1) 

A^{n + 1)(« + 2)/X« 



^max 



Note that because of the manner of normalization, the ran- 
dom variable selected can assume values greater than unity. 
Figure 2.33 shows the proposed distribution for a number of 
different variance values. 



O 2.5 
Q. 

















var (X) 




0.02 

— - - 0.04 

— - 0.06 

— ■ 0.15 


A 


71 





















-A 










-^0- 




"^^ 




^ 


<^^ 


\ 


■^-X. 



"From Walpole et al. (2007) by permission of Pearson Education. 



Random variable X 

Fig. 2.33 Probability distributions of solar radiation given by Eq. 2.55 
as proposed by Gordon and Reddy (1988) 
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Pr. 2.24 Cumulative distribution and utilizability functions for hori- 
zontal solar radiation 

Use the equations described below to derive the CDF and 
the utilizabihty functions for the Gordon-Reddy probabihty 
distribution function described in Pr. 2.23. 

Probabihty distribution functions for solar radiation (as 
in Pr. 2.23 above) and also for ambient air temperatures are 
useful to respectively predict the long-term behavior of so- 
lar collectors and the monthly/annual heating energy use of 
small buildings. For example, the annual/monthly space hea- 
ting load Qioad (in MJ) is given by: 



Qioad = {UA)BMg ■ DD ■ (86,400-^) ■ (10""^) 

day 



J 



where {U A)Bidg is the building overall energy loss/gain per 
unit temperature difference in W/°C; and DD is the degree- 
day given by: 

N 

£)£) = ^(18.3-rrf)+ in°C-day 

d=\ 

where 7^ is the daily average ambient air temperature and 
N=365 if annual time scales are considered. The "-I-" sign 
indicates that only positive values within the brackets should 
contribute to the sum, while negative values should be set to 
zero. Physically this implies that only when the ambient air 
is lower than the design indoor temperature, which is histori- 
cally taken as 18.3°C, would there be a need for heating the 
building. It is clear that the DD is the sum of the differences 
between each day's mean temperature and the design tempe- 
rature of the conditioned space. It can be derived from know- 
ledge of the PDF of the daily ambient temperature values at 
the location in question. 

A similar approach has also been developed for predicting 
the long-term energy collected by a solar collector either at 
the monthly or annual time scale involving the concept of 
solar utilizability (Reddy 1987). Consider Fig. 2.34a which 
shows the PDF function P(X) for the normalized variable X 
described in Pr. 2.23, and bounded by X andX .The area 

-' min max 

of the shaded portion corresponds to a specific value X' of 
the CDF (see Fig. 2.34b): 



F{X') = probabilityiX < X') 



/ 



P(X)dX (2.56a) 



The long-term energy delivered by a solar thermal collector 
is proportional to the amount of solar energy above a certain 
critical threshold X^,, and this is depicted as a shaded area in 
Fig. 2.34b). This area is called the solar utilizability, and is 
functionally described by: 




Radiation ratio 




Radiation ratio 




C Critical radiation ratio 

Fig. 2.34 Relation between different distributions, a Probability den- 
sity curve (shaded area represents the cumulative distribution value 
F(X')). b Cumulative distribution function (shaded area represents uti- 
lizability fraction ax Xc). c Utilizability curve. (From Reddy 1987) 



1 -^max 

(p(Xc) ^ [ (X'- Xc)dF =/"[!- F(X')dX' (2.56b) 



Fc 



Xc 



The value of the utilizability function for such a critical ra- 
diation level X^ is shown in Fig. 2.34c. 

Pr. 2.25 Generating cumulative distribution curves and uti- 
lizability curves from measured data. 

The previous two problems involved probability distribu- 
tions of solar radiation and ambient temperature, and how 
these could be used to derive functions for quantities of in- 
terest such as the solar utilizability or the Degree-Days. If 
monitored data is available, there is no need to delve into 
such considerations of probability distributions, and one can 
calculate these functions numerically. 
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Fig. 2.35 Distribution for Quezon City, Manila during October 1980. (Froin Reddy 1987) 



Consider Table P2.25 (in Appendix B) whicii assembles 
the global solar radiation on a horizontal surface at Quezon 
City, Manila during October 1980 (taken from Reddy 1987). 
You are asked to numerically generate the CDF and the utili- 
zability functions (Eq. 2.56a, b) and compare your results to 
Fig. 2.35. The symbols I and H denote hourly and daily ra- 
diation values respectively. Note that instead of normalizing 
the radiation values by the extra-terrestrial solar radiation (as 
done in Pr. 2.23), here the corresponding average values I 
(for individual hours of the day) and H have been used. 



References 

Ang, A.H.S. and W.H. Tang, 2007. Probability Concepts in Enginee- 
ring, 2°'' Ed., John Wiley and Sons, USA 

Barton, C. and S. Nishenko, 2008. Natural Disasters: Forecasting Eco- 
nomic and Life Losses, U.S. Geological Survey, Marine and Coastal 
Geology Program. 



Bolstad, W.M., 2004. Introduction to Bayesian Statistics, Wiley and 
Sons, Hoboken, NJ. 

Gordon, J. M., and T.A. Reddy, 1988. Time series analysis of daily ho- 
rizontal solar radiation. Solar Energy, 41(3), pp. 215-226 

KoUuru, R.V., S.M. Bartell, R.M. Pitblado, and R.S. Stricoff, R.S., 
1996. Risk Assessment and Management Handbook, McGraw-Hill, 
New York. 

McClave, J.T. and P.G. Benson, 1988. Statistics for Business and Eco- 
nomics, 4th Ed., Dellen and Macmillan, London. 

Nordhaus, W.D., 1994. Expert opinion on climate change, American 
Scientist, 9,2: ^^A5-5\. 

Phillips, L.D., 1973. Bayesian Statistics for Social Scientists, Thomas 
Nelson and Sons, London, UK 

Reddy, T.A., 1 987. The Design and Sizing of Active Solar Thermal Sys- 
tems, Oxford University Press, Oxford, U.K. 

Walpole, R.E., R.H. Myers, S.L. Myers, and K. Ye, 2007, Probability 
and Statistics for Engineers and Scientists, 8"' Ed., Prentice Hall, 
Upper Saddle River, NJ. 



Data Collection and Preliminary Data 
Analysis 
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This chapter starts by presenting some basic notions and 
characteristics of different types of data collection systems 
and types of sensors. Next, simple ways of validating and 
assessing the accuracy of the data collected are addressed. 
Subsequently, salient statistical measures to describe univari- 
ate and multivariate data are presented along with how to use 
them during basic exploratory data and graphical analyses. 
The two types of measurement uncertainty (bias and random) 
are discussed and the concept of confidence intervals is intro- 
duced and its usefulness illustrated. Finally, three different 
ways of determining uncertainty in a data reduction equation 
by propagating individual variable uncertainty are presented; 
namely, the analytical, numerical and Monte Carlo methods. 



3.1 Generalized Measurement System 

There are several books, for example Doebelin (1995) or Hol- 
man and Gajda (1984) which provide a very good overview 
of the general principles of measurement devices and also 
address the details of specific measuring devices (the com- 
mon ones being those that measure physical quantities such as 
temperature, fluid flow, heat flux, velocity, force, torque, pres- 
sure, voltage, current, power,...). This section will be limited 
to presenting only those general principles which would aid 
the analyst in better analyzing his data. The generalized mea- 
surement system can be divided into three parts (Fig. 3.1): 
(i) detector-transducer stage, which detects the value of 
the physical quantity to be measured and transduces or 
transforms it into another form, i.e., performs either a 
mechanical or an electrical transformation to convert 
the signal into a more easily measured and usable form 
(either digital or analog); 
(ii) intermediate stage, which modifies the direct signal by 
amplification, filtering, or other means so that an output 
within a desirable range is achieved. If there is a known 
correction (or calibration) for the sensor, it is done at 
this stage; and 



(iii) output or terminating stage, which acts to indicate, 
record, or control the variable being measured. The out- 
put could be digital or analog. 
Ideally, the output or terminating stage should indicate 
only the quantity to be measured. Unfortunately, there are 
various spurious inputs which could contaminate the desired 
measurement and introduce errors. Doebelin (1995) groups 
the inputs that may cause contamination into two basic types: 
modifying and interfering (Fig. 3.2). 

(i) Interfering inputs introduce an error component to the 
output of the detector-transducer stage in a rather direct 
manner, just as does the desired input quantity. For 
example, if the quantity being measured is temperature 
of a solar collector's absorber plate, improper shield- 
ing of the thermocouple would result in an erroneous 
reading due to the solar radiation striking the thermo- 
couple junction. As shown in Fig. 3.3, the calibration of 
the instrument is no longer a constant but is affected by 
the time at which the measurement is made, and since 
this may differ from one day to the next, the net result 
is improper calibration. Thus, solar radiation would be 
thought of as an interfering input, 
(ii) Modifying inputs have a more subtle effect, introducing 
errors by modifying the input/output relation between 
the desired value and the output measurement (Fig. 3.2). 
An example of this occurrence is when the oil used to 
lubricate the various intermeshing mechanisms of a sys- 
tem has deteriorated, and the resulting change in viscos- 
ity can lead to the input/output relation getting altered 
in some manner. 
One needs also to distinguish between the analog and the 
digital nature of the sensor output or signal (Doebelin 1995). 
For analog signals, the precise value of the quantity (voltage, 
temperature,...) carrying the information is important. Digi- 
tal signals, however, are basically binary in nature (on/off), 
and variation in numerical values is associated with changes 
in the logical state (true/false). Consider a digital electronic 
system where any voltage in the range of H- 2 to H- 5 V produces 
the on-state, while signals of to H-l.OV correspond to the 
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Fig. 3.1 Schematic of the gener- 
alized measurement system 
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Fig. 3.2 Different types of inputs 
and noise in a measurement 
system 
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Fig. 3.3 Effect of uncontrolled interfering input on calibration 

off-State. Thus, whether the voltage is 3 or 4 V has the same 
result. Consequently, such digital systems are quite tolerant to 
spurious noise effects that contaminate the information signal. 
However, many primary sensing elements and control appara- 
tus are analog in nature while the widespread use of computers 
has lead to data reduction and storage being digital. 



3.2 Performance Characteristics of Sensors 
and Sensing Systems 

There are several terms that are frequently used in connec- 
tion with sensors and data recording systems. These have 
to do with their performance characteristics, both static and 
dynamic, and these will be briefly discussed below. 



3.2.1 Sensors 

(a) Accuracy is the ability of an instrument to indicate 
the true value of the measured quantity. As shown in 
Fig. 3.4a, the accuracy of an instrument indicates 
the deviation between one, or an average of several, 
reading(s) from a known input or accepted reference 
value. The spread in the target holes in the figure is 
attributed to random effects. 

(b) Precision is the closeness of agreement among repeated 
measurements of the same physical quantity. The preci- 
sion of an instrument indicates its ability to reproduce a 
certain reading with a given accuracy. Figure 3.4b illus- 
trates the case of precise marksmanship but which is 
inaccurate due to the bias. 

(c) Span (also called dynamic range) of an instrument is 
the range of variation (minimum to maximum) of the 
physical quantity which the instrument can measure. 

(d) Resolution or least count is the smallest incremental 
value of a measured quantity that can be reliably mea- 
sured and reported by an instrument. Typically, this is 
half the smallest scale division of an analog instrument, 
or the least significant bit of an analog to digital system. 
In case of instruments with non-uniform scale, the reso- 
lution will vary with the magnitude of the output signal 
being measured. When resolution is measured at the 
origin of the calibration curve, it is called the threshold 
of the instrument (see Fig. 3.5). Thus, the threshold is 
the smallest detectable value of the measured quantity 
while resolution is the smallest perceptible change over 
its operable range. 
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Fig. 3.4 Concepts of accuracy 
and precision illustrated in terms 
of shooting at a target 
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At off-design 
temperature 




At nominal design 
temperature 



Total error due 
to temperature 



Threshold 
Fig. 3.5 Concepts of threshold and resolution 

(e) Sensitivity of an instrument is the ratio of the hnear 
movement of the pointer on an analog instrument to the 
change in the measured variable causing this motion. 
For example, a 1 mV recorder with a 25 cm scale-length, 
would have a sensitivity of 25 cm/mV if the measure- 
ments were linear over the scale. It is thus representa- 
tive of the slope of the input/output curve if assumed 
linear. All things being equal, instruments with larger 
sensitivity are preferable, but this would generally lead 
to the range of such an instrument to be smaller. Fig- 
ure 3.6 shows a linear relationship between the output 
and the input. Spurious inputs due to the modifying and 
interfering inputs can cause a zero drift and a sensitiv- 
ity drift from the nominal design curve. Some "smart" 
transducers have inbuilt corrections for such effects 
which can be done on a continuous basis. Note finally, 
that sensitivity should not be confused with accuracy 
which is entirely another characteristic. 

(f) Hysteresis (also called dead space or dead band) is the 
difference in readings depending on whether the value 
of the measured quantity is approached from above or 
below (see Fig. 3.7). This is often the result of mechani- 



Input signal 



Fig. 3.6 Zero drift and sensitivity drift 

cal friction, magnetic effects, elastic deformation, or 
thermal effects. Another cause could be when the exper- 
imenter does not allow enough time between measure- 
ments to reestablish steady-state conditions. The band 
can vary over the range of variation of the variables, as 
shown in the figure, 
(g) Calibration is the checking of the instrument output 
against a known standard, and then correcting for bias. 
The standards can be either a primary standard (say, at 
the National Institute of Standards and Technology), or 
a secondary standard with a higher accuracy than the 
instrument to be calibrated, or a known input source 
(say, checking a flowmeter against direct weighing of 
the fluid). Doebelin (1995) suggests that, as a rule of 
thumb, the primary standard should be about 10 times 
more accurate than the instrument being calibrated. 
Figure 3.8 gives a table and a graph of the results of 
calibrating a pressure measuring device. The data 
points denoted by circles have been obtained during the 
calibration process when the pressure values have been 
incrementally increased while the data points denoted 
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Output signal 




Fig. 3.7 Illustrative plot of a hysteresis band of a sensor showing local 
and maximum values 



by triangles are those obtained when the magnitude of 
the pressure has been incrementally decreased. The dif- 
ference in both sets of points is due to the hysteresis of 
the instrument. Further the "true" value and the instru- 
ment value may have a bias (or systematic error) and 
an uncertainty (or random error) as shown in Fig. 3.8. 
A linear relationship is fit to the data points to yield 
the calibration curve. Note that the fitted line need not 
necessarily be linear though practically instruments 
are designed to have such a linearity because of the 
associated convenience of usage and interpretation. 
When a calibration is completed, it is used to convert 
an instrument reading of an unknown quantity into a 
best estimate of the true value. Thus, the calibration 
curve coiTects for bias and puts numerical limits (say 
+ 2 standard deviations) on the random en^ors of the 
observations. 

The above terms basically describe the static response, 
i.e., when the physical quantity being measured does 
not change with time. Section 1.2.5 also introduced 
certain simple models for static and dynamic response 
of sensors. Usually the physical quantity will change 
with time, and so the dynamic response of the sensor 
or instrument has to be considered. In such cases, new 
ways of specifying accuracy are required. 

(h) Rise time is the delay in the sensor output response 
when the physical quantity being measured undergoes a 
step change (see Fig. 3.9). 

(i) Time constant of the sensor is defined as the time taken 
for the sensor output to attain a value of 63.2% of the 
difference between the final steady-state value and the 
initial steady-state value when the physical quantity 



being measured undergoes a step change. Though the 
concept of time constant is strictly applicable to linear 
systems only (see Sect. 1.2.5), the term is commonly 
used to all types of sensors and data recording systems, 
(j) Distortion is a very general term that is used to describe 
the variation of the output signal from the sensor from 
its true form characterized by the variation of the physi- 
cal quantity being measured. Depending on the sen- 
sor, the distortion could result either in poor frequency 
response or poor phase-shift response (Fig. 3.10). For 
pure electrical measurements, electronic devices are 
used to keep distortion to a minimum. 



3.2.2 Types and Categories of Measurements 

Measurements are categorized as either primary measure- 
ments or derived measurements. 

(i) A primary measurement is one that is obtained directly 
from the measurement sensor. This can be temperature, 
pressure, velocity, etc. The basic criterion is that a pri- 
mary measurement is of a single item from a specific 
measurement device, 
(ii) A derived measurement is one that is calculated using 
one or more measurements. This calculation can occur 
at the sensor level (an energy meter uses flow and tem- 
perature difference to report an energy rate), by a data 
logger, or can occur during data processing. Derived 
measurements can use both primary and other derived 
measurements. 
Further, measurements can also be categorized by type: 
(i) Stationary data does not change with time. Examples of 
stationary data include the mass of water in a tank, the 
area of a room, the length of piping or the volume of a 
building. Therefore, whenever the measurement is rep- 
licated, the result should be the same, independently of 
time, within the bounds of measurement uncertainty, 
(ii) Time dependent data varies with time. Examples of 
time dependent data include the pollutant concentration 
in a water stream, temperature of a space, the chilled 
water flow to a building, and the electrical power use of 
a facility. A time-dependent reading taken now could be 
different than a reading taken in the next five minutes, 
the next day, or the next year. Time dependent data can 
be recorded either as time-series or cross-sectional: 

- Time-series data consist of a multiplicity of data 
taken at a single point or location over fixed intervals 
of time, thus retaining the time sequence nature. 

- Cross-sectional data are data taken at single or mul- 
tiple points at a single instant in time with time not 
being a variable in the process. 
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Fig. 3.8 Static calibration to 
define bias and random variation 
or uncertainty. Note that s is the 
standard deviation of the differ- 
ences between measurement and 
the least squares model. (From 
Doebelin (1995) by permission of 
McGraw-Hill) 
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Fig. 3.9 Concept of rise time of the output response to a step input 



3.2.3 Data Recording Systems 

The above concepts also apply to data recording or log- 
ging systems, where, however, additional ones need also be 
introduced: 

(a) Recording interval is the time period or intervals at 
which data is recorded (a typical range for a thermal 
systems could be 1-15 min) 

(b) Scan rate is the frequency with which the recording 
system samples individual measurements; this is often 
much smaller than the recording interval (with elec- 
tronic loggers, a typical value could be one sample per 
second) 

(c) Scan interval is the minimum interval between sepa- 
rate scans of the complete set of measurements which 
includes several sensors (a typical value could be 
10-15 s) 

(d) Non-process data trigger. Care must be taken that aver- 
aging of the physical quantities that are subsequently 
recorded does not include non-process data (i.e., tem- 
perature data when the flow in a pipe is stopped but 
the sensor keeps recording the temperature of the fluid 
at rest). Often data acquisition systems use a thresh- 
old trigger to initiate acceptance of individual samples 
in the final averaged value or monitor the status of an 
appropriate piece of equipment (for example, whether a 
pump is operational or not). 
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Fig. 3.10 Effects of frequency response and phase-shift response on 
complex waveforms. (From Holman and Gajda (1984) by permission 
of McGraw-Hill) 



not so much a bag of tricks, but rather a process of critical 
assessment, exploration, testing and evaluation which comes 
by some amount of experience. 

Data reduction involves the distillation of raw data into a 
form that can be subsequently analyzed. It may involve aver- 
aging multiple measurements, quantifying necessary condi- 
tions (e.g., steady state), comparing with physical limits or 
expected ranges, and rejecting outlying measurements. Data 
validation or proofing data for consistency is a process for 
detecting and removing gross or "egregious" errors in the 
monitored data. It is extremely important to do this proofing 
or data quality checking at the very beginning, even before 
any sort of data analysis is attempted. Few such data points 
could completely overwhelm even the most sophisticated 
analysis procedures one could adopt. Note that statistical 
screening (discussed later) is more appropriate for detecting 
outliers and not for detecting gross errors. There are sev- 
eral types of data proofing, as described below (ASHRAE 
2005). 



3.3 Data Validation and Preparation 

The aspects of data collection, cleaning, validation and trans- 
formation are crucial. However, these aspects are summarily 
treated in most books, partly because their treatment involves 
adopting circumstance specific methods, and also because it 
is (alas) considered neither of much academic interest nor a 
field worthy of scientific/statistical endeavor. This process is 



3.3.1 Limit Checlts 

Fortunately, many of the measurements made in the context 
of engineering systems have identifiable limits. Limits are 
useful in a number of experimental phases such as establish- 
ing a basis for appropriate instrumentation and measurement 
techniques, rejecting individual experimental observations, 
and bounding/bracketing measurements. Measurements can 
often be compared with one or more of the following limits: 
physical, expected and theoretical. 
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(a) Physical Limits. Appropriate physical limits should be 
identified in the planning phases of an experiment so 
that they can be used to check the reasonableness of 
both raw and post-processed data. Under no circum- 
stance can experimental observations exceed physical 
limits. For example, in psychrometrics: 

- dry bulb temperature > wet bulb temperature > dew 
point temperature 

- 0<relativehumidity<100% 

Examples in refrigeration systems is that refrigerant sat- 
urated condensing temperature should always be greater 
than the outdoor air dry bulb for air-cooled condensers. 
Another example in solar radiation measurement is that 
global radiation on a surface should be greater than the 
beam radiation incident on the same surface. 
Experimental observations or processed data that 
exceed physical limits should be flagged and closely 
scrutinized to determine the cause and extent of their 
deviation from the limits. The reason for data being per- 
sistently beyond physical limits is usually instrumen- 
tation bias or errors in data analysis routines/methods. 
Data that infrequently exceed physical limits may be 
caused by noise or other related problems. Resolving 
problems associated with observations that sporadically 
exceed physical limits is often difficult. However, if 
they occur, the experimental equipment and data analy- 
sis routines should be inspected and repaired. In situa- 
tions where data occasionally exceed physical limits, it 
is often justifiable to purge such observations from the 
dataset prior to undertaking any statistical analysis or 
testing of hypotheses. 

(b) Expected Limits. In addition to identifying physical lim- 
its, expected upper and lower bounds should be identified 
for each measured variable. During the planning phase 
of an experiment, determining expected ranges for mea- 
sured variables facilitates the selection of instrumenta- 
tion and measurement techniques. Prior to taking data, it 
is important to ensure that the measurement instruments 
have been calibrated and are functional over the range 
of expected operation. An example is that the relative 
humidity in conditioned office spaces should be in the 
range between 30-65%. During the execution phase of 
experiments, the identified bounds serve as the basis for 
flagging potentially suspicious data. If individual obser- 
vations exceed the upper or lower range of expected 
values, those points should be flagged and closely scru- 
tinized to establish their validity and reliability. Another 
suspicious behavior is constant values when varying val- 
ues are expected. Typically, this is caused by an incorrect 
lower or upper bound in the data reporting system so that 
limit values are being reported instead of actual values. 

(c) Theoretical Limits. These limits may be related to physi- 
cal properties of substances (e.g., fluid freezing point). 



thermodynamic limits of a subsystem or system (e.g., 
Carnot efficiency for a vapor compression cycle), or ther- 
modynamic definitions (e.g., heat exchanger effective- 
ness between zero and one). During the execution phase 
of experiments, theoretical limits can be used to bound 
measurements. If individual observations exceed theo- 
retical values, those points should be flagged and closely 
scrutinized to establish their validity and reliability. 



3.3.2 Independent Checks Involving Mass 
and Energy Balances 

In a number of cases, independent checks can be used to 
establish viability of data once the limit checks have been 
performed. Examples of independent checks include com- 
parison of measured (or calculated) values with those of 
other investigators (reported in the published literature) and 
intra-experiment comparisons (based on component conser- 
vation principles) which involve collecting data and applying 
appropriate conservation principles as part of the validation 
procedures. The most commonly applied conservation prin- 
ciples used for independent checks include mass and energy 
balances on components, subsystems and systems. All inde- 
pendent checks should agree to within the range of expected 
uncertainty of the quantities being compared. An example 
of heat balance conservation check as applied to vapor com- 
pression chillers is that the chiller cooling capacity and the 
compressor power should add up to the heat being rejected 
at the condenser. 

Another sound practice is to design some amount of 
redundancy into the experimental design. This allows con- 
sistency and conservation checks to be performed. A simple 
example of consistency check is during the measurement of 
say the pressure differences between indoors and outdoors of 
a two-story residence. One could measure the pressure differ- 
ence between the first floor and the outdoors and the second 
floor and the outdoors, and deduce the difference in pressure 
between both floors as the difference between both measure- 
ments. Redundant consistency checking would involve also 
measuring the first floor and second floor pressure difference 
and verifying whether the three measurements are consistent 
or not. Of course such checks would increase the cost of the 
instrumentation, and their need would depend on the specific 
circumstance. 



3.3.3 Outlier Rejection by Visual Means 

This phase is undertaken after limit checks and independent 
checks have been completed. Unless there is a definite rea- 
son for suspecting that a particular observation is invalid, 
indiscriminate outlier rejection is not advised. The sensible 
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Fig. 3.1 1 Scatter plot of the hourly chilled water consumption in a 
commercial building. Some of the obvious outlier points are circled. 
(From Abbas and Haberl 1994 by permission of Haberl) 



approach is to select a reasonable rejection criterion, wiiicii 
may depend on the specific circumstance, and couple this 
with a visual inspection and a computational diagnostics of 
the data. A commonly used rejection criterion in case the dis- 
tribution is normal is to eliminate data points which are out- 
side the (3 X standard deviation) range (see Fig. 3.8). Some 
analysts advocate doing the analytical screening first; rather, 
it is suggested here that the graphical screening be done first 
since it also reveals the underlying distribution of the data. 

When dealing with correlated bivariate data, relational 
scatter plots (such as x-y scatter plots) are especially use- 
ful since they also allow outliers to be detected with rela- 
tive ease by visual scrutiny. The hourly chilled water energy 
use in a commercial building is plotted against outside dry- 
bulb temperature in Fig. 3.11. One can clearly detect several 
of the points which fall away from the cloud of data points 
and which could be weeded out. Further, in cases when the 
physical process is such that its behavior is known at a limit 
(for example, both variables should be zero together), one 
could visually extrapolate the curve and determine whether 
this is more or less true. Outlier rejection based on statistical 
considerations is treated in Sect. 3.6.6. 



3.3.4 Handling Missing Data 

Data is said to be missing, as against bad data during outlier 
detection, when the channel goes "dead" indicating either 
a zero value or a very small value which is constant over 
time when the physics of the process would strongly indicate 
otherwise. Missing data are bound to occur in most monitor- 
ing systems, and can arise from a variety of reasons. First, 
one should spend some time trying to ascertain the extent 
of the missing data and whether it occurs preferentially, i.e., 
whether it is non-random. For example, certain sensors (such 



as humidity sensors, flow meters or pollutant concentration) 
develop faults more frequently than others, and the data set 
becomes biased. This non-random nature of missing data is 
more problematic than the case of data missing at random. 

There are several approaches to handling missing data. It 
is urged that the data be examined first before proceeding to 
rehabilitate it. These approaches are briefly described below: 

(a) Use observations with complete data only: This is the 
simplest and most obvious, and is adopted in most anal- 
ysis. Many of the software programs allow such cases 
to be handled. Instead of coding missing values as zero, 
analysts often use a default value such as -99 to indi- 
cate a missing value. This approach is best suited when 
the missing data fraction is small enough not to cause 
the analysis to become biased. 

(b) Reject variables: In case only one or a few channels indi- 
cate high levels of missing data, the judicious approach 
is to drop these variables from the analysis itself. If 
these variables are known to be very influential, then 
more data needs to be collected with the measurement 
system modified to avoid such future occurrences. 

(c) Adopt an imputation method: This approach, also called 
data rehabilitation, involves estimating the missing val- 
ues based on one of the following methods: 

(i) substituting the missing values by a constant value 
is easy to implement but suffers from the drawback 
that it would introduce biases, i.e., it may distort 
the probability distribution of the variable, its vari- 
ance and its correlation with other variables; 

(ii) substituting the missing values by the mean of the 
missing variable deduced from the valid data. It suf- 
fers from the same distortion as (i) above, but would 
perhaps add a little more realism to the analysis; 

(iii) univariate interpolation where missing data from 
a specific variable are predicted using time series 
methods. One can use numerical methods used to 
interpolate between tabular data as is common in 
many engineering applications (see any appropri- 
ate textbook on numerical methods; for example, 
Ayyub and McCuen 1996). One method is that of 
undetermined coefficients where a nth order poly- 
nomial (usually second or third order suffice) is 
used as the interpolation function whose numeri- 
cal values are obtained by solving n simultaneous 
equations. The Gregory-Newton method results in 
identifying a similar polynomial function without 
requiring a set of simultaneous equations to be 
solved. Another common interpolation method is 
the Lagrange polynomials method (applicable to 
data taken at unequal intervals). One could also use 
trigonometric functions with time as the regressor 
variable provided the data exhibits such periodic 
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Table 3.1 Saturation water pressure with temperature 
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variations (say, the diurnal variation of outdoor 
dry-bulb temperature). This approach works well 
when the missing data gaps are short and the pro- 
cess is sort of stable; 
(iv) regression methods which use a regression model 
between the variable whose data is missing and 
other variables with complete data. Such regression 
models can be simple regression models or could 
be multivariate models depending on the specific 
circumstance. Many of the regression methods 
(including splines which are accurate especially 
for cases where data exhibits large sudden changes 
and which are described in Sect. 5.7.2) can be 
applied. However, the analyst should be cognizant 
of the fact that such a method of rehabilitation 
always poses the danger of introducing, sometimes 
subtle, biases in the final analysis results. The pro- 
cess of rehabilitation may have unintentionally 
given a structure or an interdependence which may 
not have existed in the phenomena or process. 

Example 3.3.1 Example of interpolation. 
Consider Table 3.1 showing the saturated water vapor pres- 
sure (P) against temperature (T). Let us assume that the 
mid-point (T=58°C) is missing (see Fig. 3.12). The use of 
different interpolation methods to determine this point using 
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Fig. 3.1 2 Simple linear interpolation to estimate value of missing 
point 



two data points on either side of the missing point is illus- 
trated below. 

(a) Simple linear interpolation: Since the x-axis data are at 
equal intervals, once would estimate 

P(58°C) = (15.002h-21.84)/2= 18.421 which is 1.5% 
too high. 

(b) Method of undetermined coefficients using third order 
model: In this case, a more flexible functional form of the 
type: P-a+b-T+c-T^ + d-THs assumed, and using data 
from the four sets of points, the following four simultane- 
ous equations need to be solved for the four coefficients: 

12.335 ^a + b- (50) + c ■ (SOf + d ■ (50)^ 
15.002 ^a + b- (54) + c ■ (54)^ + d ■ (54)^ 
21.840 ^a + b- (62) + c ■ (62)^ + d ■ (62)^ 

26.150 ^a + b- (66) + c ■ {66f + d ■ {66f 

Once the polynomial function is known, it can be used 
to predict the value of P at T=58°C. 

(c) Gregory-Newton method takes the form: 

y — ai -\- a2(x — xi) + a^ix — xi){x — X2) + ■ ■ ■ 
Substituting each set of data point in turn results in 

yi -fli 



ai =^1,02 
03 



X2 -XI 
(B -ai) -fl2 ■ (x3 -x\) 



(X3 - xi) ■ (X3 - X2) 

and so on 

It is left to the reader to use these formulae and estimate 

thevalueofPatT=58°C ■ 



3.4 Descriptive Measures for Sample Data 

3.4.1 Summary Statistical Measures 

Descriptive summary measures of sample data are meant to 
characterize salient statistical features of the data for easier 
reporting, understanding, comparison and evaluation. The 
following are some of the important ones: 
(a) Mean (or arithmetic mean or average) of a set or sam- 
ple of n numbers is: 



1 " 

''mean ^ ^ ^ ~ / ,^i 

where n = sample size, and x= individual reading 



(3.1) 
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(b) Weighted mean of a set of n numbers is: 

n 



i=l 



(3.2) 



where w is the weight for group i. 
(c) Geometric mean is more appropriate when studying 
phenomenon that exhibit exponential behavior (like 
population growth, biological processes,...). This is 
defined as the nth root of the product of n data points: 



^geometric — L^l ■X2 — XjiJ 



l/« 



(3.3) 



(d) Mode is the value of the variate which occurs most fre- 
quently. When the variate is discrete, the mean may turn 
out to have a value that cannot actually be taken by the 
variate. In case of continuous variates, the mode is the 
value where the frequency density is highest. For exam- 
ple, a survey of the number of occupants in a car during 
the rush hour could yield a mean value of 1.6 which is 
not physically possible. In such cases, using a value of 
2 (i.e., the mode) is more appropriate. 

(e) Median is the middle value of the variates, i.e., half the 
numbers have numerical values below the median and 
half above. The mean is unduly influenced by extreme 
observations, and in such cases the median is a more 
robust indicator of the central tendency of the data. In 
case of an even number of observations, the mean of the 
middle two numbers is taken to be the median. 

(f) Range is the difference between the largest and the 
smallest observation values. 

(g) Percentiles are used to separate the data into bins. Let 
p be a number between and 1. Then, the (lOOp)th per- 
centile (also called pth quantile), represents the data 
value where 100p% of the data values are lower. Thus, 
90% of the data will be below the 90th percentile, and 
the median is represented by the 50th percentile. 

(h) Inter-quartile range (IQR) cuts out the more extreme 
values in a distribution. It is the range which covers the 
middle 50% of the observations and is the difference 
between the lower quartile (i.e., the 25th percentile) and 
the upper quartile (i.e., the 75th percentile). In a similar 
manner, deciles divide the distribution into tenths, and 
percentiles into hundreths. 

(i) Deviation of a number x in a set of n numbers is a 
measure of dispersion of the data from the mean, and is 
given by: 



di = (Xi - x ) 



(3.4) 



(j) The mean deviation of a set of n numbers is the mean 
of the absolute deviations: 



n ^ — ' 



(3.5) 



(k) The variance or the mean square error (MSB) of a set 
of n numbers is: 



1 " 

n — 1 ^ — ' 



( = 1 



n- 1 



where " 

s^^=sum of squares = 2_^ (xi — x) 

i=\ 

(1) The standard deviation of a set of n numbers 



(3.6a) 



(3.6b) 



n- 1 



(3.7) 



The more variation there is in the data set, the bigger 
the standard deviation. This is a measure of the actual 
absolute error. For large samples (say, n> 100), one can 
replace (n- 1) by n in the above equation with accept- 
able error, 
(m) Coefficient of variation is a measure of the relative 
error, and is often more appropriate than the standard 
deviation. It is defined as the ratio of the standard devia- 
tion and the mean: 



CV=i,/x 



(3.8) 



1=1 



This measure is also used in other disciplines: the recip- 
rocal of the "signal to noise ratio" is widely used in 
electrical engineering, and also as a measure of ''risk" 
in financial decision making. 
(n) Trimmed mean. The sample mean may be very sensi- 
tive to outliers, and, hence, may bias the analysis results. 
The sample median is more robust since it is impervi- 
ous to outliers. However, non-parametric tests which 
use the median are less efficient than parametric tests 
in general. Hence, a compromise is to use the trimmed 
mean value which is less sensitive to outliers than the 
mean but is more sensitive than the median. One selects 
a trimming percentage 100r% with the recommendation 
that 0<r<0.25. Suppose one has a data set with n = 20. 
Selecting r=0.1 implies that the trimming percentage is 
10% (i.e., two observations). Then, two of the largest 
values and two of the smallest values of the data set are 
rejected prior to subsequent analysis. Thus, a specified 
percentage of the extreme values can be removed. 

Example 3.4.1 Exploratory data analysis of utility bill data 
The annual degree-day number (DD) is a statistic specific to 
the climate of the city or location which captures the annual 
variation of the ambient dry-bulb temperature usually above 
a pre-specified value such as 65°F or 18.3°C (see Pr. 2.24 for 
description). Gas and electric utilities have been using the DD 
method to obtain a first order estimate of the gas and electric 
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Table 3.2 


Values of the heat loss coefficient for 90 homes (Example 


3.4.1) 










2.97 


4.00 


5.20 


5.56 


5.94 


5.98 


6.35 


6.62 


6.72 


6.78 


6.80 


6.85 


6.94 


7.15 


7.16 


7.23 


7.29 


7.62 


7.62 


7.69 


7.73 


7.87 


7.93 


8.00 


8.26 


8.29 


8.37 


8.47 


8.54 


8.58 


8.61 


8.67 


8.69 


8.81 


9.07 


9.27 


9.37 


9.43 


9.52 


9.58 


9.60 


9.76 


9.82 


9.83 


9.83 


9.84 


9.96 


10.04 


10.21 


10.28 


10.28 


10.30 


10.35 


10.36 


10.40 


10.49 


10.50 


10.64 


10.95 


11.09 


11.12 


11.21 


11.29 


11.43 


11.62 


11.70 


11.70 


12.16 


12.19 


12.28 


12.31 


12.62 


12.69 


12.71 


12.91 


12.92 


13.11 


13.38 


13.42 


13.43 


13.47 


13.60 


13.96 


14.24 


14.35 


15.12 


15.24 


16.06 


16.90 


18.26 



use of residences in tiieir service territory. The annual heating 
consumption Q of a residence can be predicted as: 

Q = U X A X DD 

where U is the overall heat loss coefficient of the residence 
(includes heat conduction as well as air infiltration,. . .) and A 
is the house floor area. 

Based on gas bills, a certain electric company calculated 
the U value of 90 homes in their service territory in an effort 
to determine which homes were "leaky", and hence are good 
candidates for weather stripping so as to reduce their energy 
use. These values (in units which need not concern us here) 
are given in Table 3.2. 

An exploratory data analysis would involve generating the 
types of pertinent summary statistics or descriptive measures 
given in Table 3.3. Note that no value is given for "Mode" 
since there are several possible values in the table. What can 
one say about the variability in the data? If all homes whose 
U values are greater than twice the mean value are targeted 
for further action, how many such homes are there? Such 
questions and answers are left to the reader to explore. ■ 



3.4.2 Covariance and Pearson Correlation 
Coefficient 

Though a scatter plot of bivariate numerical data gives a 
good visual indication of how strongly variables x and y vary 
together, a quantitative measure is needed. This is provided 
by the covariance which represents the strength of the linear 
relationship between the two variables: 



cov(xy) 



1 



1 



^(x, -x)-(y,-y) 



(3.9) 



where x and y are the mean values of variables x and y. 

To remove the effect of magnitude in the variation of x 
and y, the Pearson correlation coefficient r is probably more 
meaningful than the covariance since it standardizes the 
coefficients x and y by their standard deviations: 



Table 3.3 Summary 
(Example 3.4.1) 


statistics for values of the heat loss coefficient 


Count 


90 


Average 


10.0384 


Median 


9.835 


Mode 


Geometric mean 


9.60826 


5% Trimmed mean 


9.98444 


Variance 


8.22537 


Standard deviation 


2.86799 


Coeff . of variation 


28.5701% 


Minimum 


2.97 


Maximum 


18.26 


Range 


15.29 


Lower quartile 


7.93 


Upper quartile 


12.16 


Interquartile range 


4.23 



r — 



cov(xy) 



(3.10) 



J;CJ-y 



where s and s are the standard deviations of x and y. 

Hence the absolute value of r is less than or equal to 
unity. r= 1 implies that all the points lie on a straight line, 
while r=0 implies no linear correlation between x and y. It is 
pertinent to point out that for linear models r^=R^ (the well 
known coefficient of determination used in regression and 
discussed in Sect. 5.3.2), the use of lower case and upper 
case to denote the same quantity being a historic dichotomy. 
Figure 3.13 illustrates how the different data scatter affect 
the magnitude and sign of r. Note that a few extreme points 
may exert undue influence on r especially when data sets are 
small. As a general thumb rule', for applications involving 
engineering data where the random uncertainties are low: 

abs(r) > 0.9 -^ strong linear correlation 

0.7 < abs(r) < 0.9 -^ moderate (3.11) 

0.7 > abs(r) -^ weak 



' A more statically sound procedure is described in Sect. 4.2.7 which 
allows one to ascertain whether observed correlation coefficients are 
significant or not. 
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Fig. 3.13 Illustration of various plots with different correlation 
strengths. (From Wonnacutt and Wonnacutt (1985) by permission of 
John Wiley and Sons) 

It is very important to note that inferring non-association of 
two variables x and y from inspection of their correlation 
coefficient is misleading since it only indicates linear rela- 
tionship. Hence, a poor correlation does not mean that no 
relationship exist between them (for example, a second order 
relation may exist between x and y; see Fig. 3.13f). Note 
also that correlation analysis does not indicate whether the 
relationship is causal, i.e. one cannot assume causality just 
because a correlation exists. Finally, keep in mind that the 
correlation analysis does not provide an equation for predict- 
ing the value of a variable — this is done under model build- 
ing (see Chap. 5). 



Table 3.4 Extension of a spring with applied load 






Load (Newtons) 2 4 6 8 


10 


12 


Extension (mm) 10.4 19.6 29.9 42.2 


49.2 


58.5 



engineering analyses can be performed. Examples include 
converting into appropriate units, taking ratios, transform- 
ing variables, . . . Sometimes, normalization methods may be 
required which are described below: 

(a) Decimal scaling moves the decimal point but still pre- 
serves most of the original data. The specific observa- 
tions of a given variable may be divided by 10" where 
x is the minimum value so that all the observations are 
scaled between - 1 and 1 . For example, say the largest 
value is 289 and the smallest value is - 150, then since 
x=3, all observations are divided by 1000 so as to lie 
between [0.289 and -0.150]. 

Min-max scaling allows for better distribution of 
observations over the range of variation than does deci- 
mal scaling. It does this by redistributing the values to 
lie between [- 1 and 1]. Hence, each observation is nor- 
malized as follows: 



(b) 



(c) 



Xi 



(3.12) 



where x and x are the maximum and minimum 

max mm 

numerical values respectively of the x variable. Note 
that though this transformation may look very appeal- 
ing, the scaling relies largely on the minimum and max- 
imum values, which are generally not very robust and 
may be error prone. 

Standard deviation scaling is widely used for distance 
measures (such as in multivariate statistical analysis) 
but transforms data into a form unrecognizable from 
the original data. Here, each observation is normalized 
as follows: 



Xi 



(3.13) 



Example 3.4.2 The following observations are taken of the 
extension of a spring under different loads (Table 3.4). 

The standard deviations of load and extension are 3.7417 
and 18.2978 respectively, while the correlation coeffi- 
cient =0.9979. This indicates a very strong positive correla- 
tion between the two variables as one should expect. ■ 



3.4.3 Data Transformations 

Once the above validation checks have been completed, the 
raw data can be transformed to one on which subsequent 



where x and s are the mean and standard deviation 
respectively of the x variable. 



3.5 Plotting Data 

Graphs serve two purposes. During exploration of the data, 
they provide a better means of assimilating broad qualitative 
trend behavior of the data than can be provided by tabular 
data. Second, they provide an excellent manner of communi- 
cating to the reader what the author wishes to state or illus- 
trate (recall the adage "a picture is worth a thousand words"). 
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Hence, they can serve as mediums to communicate informa- 
tion, not just to explore data trends (an excellent reference 
is Tufte 2001). However, it is important to be clear as to the 
intended message or purpose of the graph, and also tailor 
it as to be suitable for the intended audience's background 
and understanding. A pretty graph may be visually appeal- 
ing but may obfuscate rather than clarify or highlight the 
necessary aspects being communicated. For example, unless 
one is experienced, it is difficult to read numerical values 
off of 3-D graphs. Thus, graphs should present data clearly 
and accurately without hiding or distorting the underlying 
intent. Table 3.5 provides a succinct summary of graph for- 
mats appropriate for different applications. 

Graphical methods are recommended after the numeri- 
cal screening phase is complete since they can point out 
unflagged data errors. Historically, the strength of a graphi- 
cal analysis was to visually point out to the analyst relation- 
ships (linear or non-linear) between two or more variables in 
instances when a sound physical understanding is lacking, 
thereby aiding in the selection of the appropriate regression 
model. Present day graphical visualization tools allow much 
more than this simple objective, some of which will become 
apparent below. There are a very large number of graphi- 
cal ways of presenting data, and it is impossible to cover 
them all. Only a small representative and commonly used 
plots will be presented below, while operating manuals of 
several high-end graphical software programs describe com- 
plex, and sometimes esoteric, plots which can be generated 
by their software. 



Table 3.5 Type and function of graph message determines format. 
(Downloaded from http://www.eia.doe.gov/neic/graphs/introduc.htm) 

Type of Function Typical format 

message 



Component Shows relative size of 
various parts of a whole 



- Pie chart (for 1 or 2 
iinportant components) 

- Bar chart 

- Dot chart 

- Line chart 



Relative 
amounts 


Ranks items according to 
size, impact, degree, etc. 


- Bar chart 

- Line chart 

- Dot chart 


Time series 


Shows variation over time 


- Bar chart (for few 
intervals) 

- Line chart 


Frequency 


Shows frequency of 
distribution among certain 
intervals 


- Histogram 

- Line chart 

- Box-and- Whisker 


Cortelation 


Shows how changes in 
one set of data is related 
to another set of data 


- Paired bar 

- Line chart 

- Scatter diagram 



3.5.1 Static Graphical Plots 

Graphical representations of data are the backbone of explor- 
atory data analysis. They are usually limited to one-, two- and 
three-dimensional data. In the last few decades, there has been 
a dramatic increase in the types of graphical displays largely 
due to the seminal contributions of Tukey (1988), Cleveland 
(1985) and Tufte (1990, 2001). A particular graph is selected 
based on its ability to emphasize certain characteristics or 
behavior of one-dimensional data, or to indicate relations 
between two- and three-dimension data. A simple manner of 
separating these characteristics is to view them as being: 
(i) cross-sectional (i.e., the sequence in which the data has 

been collected is not retained), 
(ii) time series data, 
(iii) hybrid or combined, and 

(iv) relational (i.e., emphasizing the joint variation of two or 
more variables). 
An emphasis on visualizing data to be analyzed has 
resulted in statistical software programs becoming increas- 
ingly convenient to use and powerful towards this end. Any 
data analysis effort involving univariate and bivariate data 
should start by looking at basic plots (higher dimension data 
require more elaborate plots discussed later). 

(a) for univariate data: 

Commonly used graphics for cross-sectional representation 
are mean and standard deviation, steam-and-leaf, histograms, 
box-whisker-mean plots, distribution plots, bar charts, pie 
charts, area charts, quantile plots. Mean and standard devia- 
tion plots summarize the data distribution using the two most 
basic measures; however, this manner is of limited use (and 
even misleading) when the distribution is skewed. For uni- 
variate data, plotting of histograms is very useful since they 
provide insight into the underlying parent distribution of data 
dispersion, and can flag outliers as well. There are no hard 
and fast rules of how to select the number of bins (N|^i„,) or 
classes in case of continuous data, probably because there 
is no theoretical basis. Generally, the larger the number of 
observations n, the more classes can be used, though as a 
guide it should be between 5 and 20. Devore and Fornum 
(2005) suggest: 

Number of bins or classes = Nhins — (m)'^^ (3.14) 

which would suggest that if n = 1 00, R =10 

'^'^ ' bins 

Doebelin (1995) proposes another equation: 



/V,,„, = 1.87.(n-l)' 



0.4 



(3.15) 



which would suggest that if n= 100, N, = 12. 

C'O ' bins 

The box and whisker plots also summarize the distribu- 
tion, but at different percentiles (see Fig. 3.14). The lower 
and upper box values (or hinges) correspond to the 25th and 
75th percentiles (i.e., the interquartile range (IQR) defined 
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Fig. 3.14 Box and whisker plot 
and its association with a normal 
distribution. The box represents 
the 50th percentile range while 
the whiskers extend 1.5 times 
the inter-quartile range (IQR) 
on either side. (From Wikipedia 
website) 
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in Sect. 3.4.1) while the whiskers extend to 1.5 times the 
IQR on either side. These allow outliers to be detected. Any 
observation farther than (3.0 x IQR) from the closest quar- 
tile is taken to be an extreme outlier, while if farther than 
(1.5 X IQR), it is considered to be a mild outlier. 

Though plotting a box-and-whisker plot or a plot of the 
distribution itself can suggest the shape of the underlying 
distribution, a better visual manner of ascertaining whether 
a presumed underlying parent distribution applies to the data 
being analyzed is to plot a quantile plot (also called the prob- 
ability plot). The observations are plotted against the parent 
distribution (which could be any of the standard probability 
distributions presented in Sect. 2.4), and if the points fall on 
a straight line, this suggests that the assumed distribution is 
plausible. The example below illustrates this concept. 

Example 3.5.1 An instructor wishes to ascertain whether 
the time taken by his students to complete the final exam fol- 
lows a normal or Gaussian distribution. The values in min- 
utes shown in Table 3.6 have been recorded. 

The quantile plot for this data assuming the parent dis- 
tribution to be Gaussian is shown in Fig. 3.15. The pattern 
is obviously nonlinear, so a Gaussian distribution is implau- 
sible for this data. The apparent break appearing in the data 



on the right side of the graph is indicative of data that con- 
tains outliers (caused by five students taking much longer to 
complete the exam). ■ 

Example 3.5.2 Consider the same data set as for Example 
3.4.1. The following plots have been generated (Fig. 3.16): 

(b) Box and whisker plot 

(c) Histogram of data (assuming 9 bins) 

(d) Normal probability plots 

(e) Run chart 

It is left to the reader to identify and briefly state his 
observations regarding this data set. Note that the run chart 
is meant to retain the time series nature of the data while the 
other graphics do not. The manner in which the run chart has 
been generated is meaningless since the data seems to have 
been entered into the spreadsheet in the wrong sequence, 
with data entered column- wise instead of row- wise. The 
run chart, had the data been entered correctly, would have 

Table 3.6 Values of time taken (in minutes) for 20 students to com- 
plete an exam 



37.0 


37.5 


38.1 


40.0 


40.2 


40.8 


41.0 


42.0 


43.1 


43.9 


44.1 


44.6 


45.0 


46.1 


47.0 


62.0 


64.3 


68.8 


70.1 


74.5 
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-2-10 1 2 

Normal quantile 

Fig. 3.1 5 Quantile plot of data in Table 3.15 assuming a Gaussian nor- 
mal distribution 



resulted in a monotonically increasing curve and been more 
meaningful. ■ 

(b)ybr bi-variate and multi-variate data 
There are numerous graphical representations which fall in 
this category and only an overview of the more common plots 
will be provided here. Multivariate stationary data of world- 
wide percentages of total primary energy sources can be 
represented by the widely used pie chart (Fig. 3.17a) which 
allows the relative aggregate amounts of the variables to be 
clearly visualized. The same information can also be plotted 
as a bar chart (Fig. 3. 17b) which is not quite as revealing. 

More elaborate Bar charts (such as those shown in 
(Fig. 3.18) allow numerical values of more than one variable 



to be plotted such that their absolute and relative amounts 
are clearly highlighted. The plots depict differences between 
the electricity sales during each of the four different quar- 
ters of the year over 6 years. Such plots can be drawn as 
compounded plots to allow better visual inter-comparisons 
(Fig. 3.18a). Column charts or stacked charts (Fig. 3.18b, c) 
show the same information as that in Fig. 3.18a but are 
stacked one above another instead of showing the numeri- 
cal values side-by-side. One plot shows the stacked values 
normalized such that the sum adds to 100%, while another 
stacks them so as to retain their numerical values. Finally, 
the same information can be plotted as an area chart wherein 
both the time series trend and the relative magnitudes are 
clearly highlighted. 

Time series plots or relational plots or scatter plots (such 
as x-y plots) between two variables are the most widely 
used types of graphical displays. Scatter plots allows visual 
determination of the trend line between two variables and 
the extent to which the data scatter around the trend line 
(Fig. 3.19). 

Another important issue is that the manner of selecting 
the range of the variables can be misleading to the eye. The 
same data is plotted in Fig. 3.20 on two different scales, but 
one would erroneously conclude that there is more data scat- 
ter around the trend line for (b) than for (a). This is referred 
to as the lie factor defined as the ratio of the apparent size 
of effect in the graph and the actual size of effect in the data 
(Tufte 2001). The data at hand and the intent of the analy- 



Fig.3.16 Various exploratory 
plots for data in Table 3.2 
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Fig. 3.1 7 Two different ways of plotting stationary data. Data cor- 
responds to worldwide percentages of total primary energy supply in 
2003. (From lEA, World Energy Outlook, lEA, Paris, France, 2004) 



sis should dictate the scale of the two axes, but it is diffi- 
cult in practice to determine this heuristically-. It is in such 
instances that statistical measures can be used to provide an 
indication of the magnitude of the graphical scales. 

Dot plots are simply one dimensional plots where each 
dot is an observation on an univariate scale. The 2-D version 
of such plots is the well-known x-y scatter plot. An addi- 
tional variable representative of a magnitude can be included 
by increasing the size of the plot to reflect this magnitude. 
Figure 3.21 shows such a representation for the commute 
patterns in major U.S. cities in 2008. 

Combination charts can take numerous forms, but in 
essence, are those where two different basic ways of repre- 
senting data are combined together. One example is Fig. 3.22 
where the histogram depicts actual data spread, the distribu- 
tion of which can be visually evaluated against the standard 
normal curve. 

For purposes of data checking, x-y plots are perhaps most 
appropriate as discussed in Sect. 3.3.3. The x-y scatter plot 



(Fig. 3.11) of hourly cooling energy use of a large institu- 
tional building versus outdoor temperature allowed outliers 
to be detected. The same data could be summarized by com- 
bined box and whisker plots (first suggested by Tukey 1988) 
as shown in Fig. 3.23. Here the x-axis range is subdivided 
into discrete bins (in this case, 5°F bins), showing the median 
values (joined by a continuous line) along with the 25th per- 
centiles on either side of the mean (shown boxed) and the 
10th and 90th percentiles indicated by the vertical whiskers 
from the box, and the values less than the 10th percentile and 
those greater than the 90th percentile are shown as individual 
pluses (-1-).^^ Such a representation is clearly a useful tool for 
data quality checking, for detecting underlying patterns in 
data at different sub-ranges of the independent variable, and 
also for ascertaining the shape of the data spread around this 
pattern. 

(c)for higher dimension data: 

Some of the common plots are multiple trend lines, contour 
plots, component matrix plots, and three-dimension charts. 
In case the functional relationship between the independent 
and dependent variables changes due to known causes, it 
is advisable to plot these in different frames. For example, 
hourly energy use in a commercial building is known to 
change with time of day but the functional relationship is 
quite different dependent on the season (time of year). Com- 
ponent-effect plots are multiple plots between the variables 
for cold, mild and hot periods of the year combined with box 
and whisker type of presentation. They provide more clar- 
ity in underlying trends and scatter as illustrated in Fig. 3.24 
where the time of year is broken up into three temperature 
bins. 

Three dimensional (or 3-D) plots are being increasingly 
used from the past few decades. They allow plotting varia- 
tion of a variable when it is influenced by two independent 
factors (Fig. 3.25). They also allow trends to be gauged and 
are visually appealing but the numerical values of the vari- 
ables are difficult to read. 

Another benefit of such 3-D plots is their ability to aid in 
the identification of oversights. For example, energy use data 
collected from a large commercial building could be improp- 
erly time-stamped; such as, overlooking daylight savings 
shift or misalignment of 24-hour holiday profiles (Fig. 3.26). 
One negative drawback associated with these graphs is the 
difficulty in viewing exact details such as the specific hour 
or specific day on which a misalignment occurs. Some ana- 
lysts complain that 3-D surface plots obscure data that is 
behind "hills" or in "valleys". Clever use of color or dotted 
lines have been suggested to make it easier to interpret such 
graphs. 



^ Generally, it is wise, at least at the onset, to adopt scales starting from 
zero, view the resulting graphs and make adjustments to the scales as 
appropriate. 



' Note that the whisker end points are different than those described 
earlier in Sect. 3.5.1. Different textbooks and papers adopt slightly dif- 
ferent selection criteria. 
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Fig. 3.1 8 Different types of 
bar plots to illustrate year by 
year variation (over 6 years) in 
quarterly electricity sales (in 
GigaWatt-hours) for a certain city 
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Fig. 3.19 Scatter plot (or x-y plot) with trend line through the observa- 
tions. In this case, a second order quadratic regression model has been 
selected as the trend line 
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Fig. 3.20 Figure to illustrate how the effect of resolution can mislead 
visually. The same data is plotted in the two plots but one would errone- 
ously conclude that there is more data scatter around the trend line for 
(b) than for (a). 



Often values of physical variables need to be plotted 
against two physical variables. One example is the well-know 
psychrometric chart which allows one to determine (for a 
given elevation) the various properties of air-water mix- 
tures (such as relative humidity, specific volume, enthalpy, 
wet bulb temperature) when the mixture is specified by its 
dry-bulb temperature and the humidity ratio. In such cases, a 
series of lines are drawn for each variable at selected numeri- 
cal values. A similar and useful representation is a contour 
plot which is a plot of iso-lines of the dependent variable 
at different preselected magnitudes drawn over the range of 
variation of the two independent variables. An example is 
provided by Fig. 3.27 where the total power of a condenser 
loop of a cooling system is the sum of the pump power and 
the cooling tower fan. 

Another visually appealing plot is the sun-path diagram 
which allows one to determine the position of the sun in 
the sky (defined by the solar altitude and the solar azimuth 
angles) at different times of the day and the year for a loca- 
tion of latitude 40° N (Fig. 3.28). Such a representation has 
also been used to determine periods of the year when shad- 
ing occurs from neighboring obstructions. Such consider- 
ations are important while siting solar systems or designing 
buildings. 

Figure 3.29 called carpet plots (or scatter plot matrix) is 
another useful representation of visualizing multivariate data. 
Here the various permutations of the variables are shown as 
individual scatter plots. The idea, though not novel, has merit 
because of the way the graphs are organized and presented. 
The graphs are arranged in rows and columns such that each 
row or column has all the graphs relating a certain variable 
to all the others; thus, the variables have shared axes. Though 
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Fig. 3 .2 1 Commute patterns in 
major U.S. cities in 2008 shown 
as enhanced dot plots with the 
size of the dot representing the 
number of commuters. (From 
Wikipedia website) 
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Fig. 3.22 Several combination charts are possible. The plots shown 
allows visual comparison of the standardized (subtracted by the mean 
and divided by the standard deviation) hourly whole-house electricity 
use in a large number of residences against the standard normal distri- 
bution. (From Reddy 1990) 
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Fig. 3.23 Scatter plot combined with box-whisker-mean (BWM) plot 
of the same data as shown in Fig. 3.11. (FromHaberl and Abbas (1998) 
by permission of Haberl) 



there are twice as many graphs as needed minimally (since 
each graph has another one with the axis interchanged), 
the redundancy is sometimes useful to the analyst in better 
detecting underlying trends. 



3.5.2 High-Interaction Grapliical Methods 

The above types of plots can be generated by relatively low 
end data analysis software programs. More specialized soft- 
ware programs called data visualization software are avail- 
able which provide much greater insights into data trends, 
outliers and local behavior, especially when large amounts of 
data are being considered. Animation has also been used to 
advantage in understanding system behavior from monitored 
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Fig. 3.24 Example of a combined box-whisker-component plot depicting how hourly energy use varies with hour of day during a year for dif- 
ferent outdoor temperature bins for a large commercial building. (From ASHRAE 2002 © American Society of Heating, Refrigerating and Air- 
conditioning Engineers, Inc., www.ashrae.org) 
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Fig. 3.25 Three dimensional surface chaits of mean hourly whole- 
house electricity during different hours of the day across a large number 
of residences. (From Reddy 1990) 



data since time sequence can be retained due to, say, seasonal 
differences. Animated scatter plots of the x and y variables 
as well as animated contour plots, with color superimposed, 
which can provide better visual diagnostics have also been 
developed. 

More sophisticated software is available which, however, 
requires higher user skill. Glaser and Ubbelohde (2001) 
describe novel high performance visualization techniques 
for reviewing time dependent data common to building 
energy simulation program output. Some of these techniques 
include: (i) brushing and linking where the user can investi- 
gate the behavior during a few days of the year, (ii) tessel- 
lating a 2-D chart into multiple smaller 2-D charts giving a 
4-D view of the data such that a single value of a representa- 
tive sensor can be evenly divided into smaller spatial plots 
arranged by time of day, (iii) magic lenses which can zoom 
into a certain portion of the room, and (iv) magic brushes. 
These techniques enable rapid inspection of trends and sin- 
gularities which cannot be gleaned from conventional view- 
ing methods. 



3.5.3 Graphical Treatment of Outliers 

No matter how carefully an experiment is designed and per- 
formed, there always exists the possibility of serious errors. 
These errors could be due to momentary instrument malfunc- 
tion (say dirt sticking onto a paddle-wheel of a flow meter), 
power surges (which may cause data logging errors), or the 
engineering system deviating from its intended operation 
due to random disturbances. Usually, it is difficult to pin- 
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Fig. 3.26 Example of a three- 
dimensional plots of measured 
hourly electricity use in a 
commercial building over nine 
months. (From ASHRAE 2002 
© American Society of Heating, 
Refrigerating and Air-condition- 
ing Engineers, Inc., www.ashrae. 
org) 
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Fig. 3.27 Contour plot characterizing the sensitivity of total power 
consumption (condenser water pump power plus tower fan power) to 
condenser water-loop controls for a single chiller load, ambient wet- 
bulb temperature and chilled water supply temperature. (From Braun 
et al. (1989) © American Society of Heating, Refrigerating and Air- 
conditioning Engineers, Inc., www.ashrae.org) 



point the cause of the anomahes. The experimenter is often 
not fully sure whether the outlier is anomalous, or whether 
it is a valid or legitimate data point which does not conform 
to what the experimenter "thinks" it should. In such cases, 
throwing out a data point may amount to data "tampering" or 
fudging of results. Usually, data which exhibit such anoma- 
lous tendency are a minority. Even then, if the data analyst 
retains these questionable observations, they can bias the 
results of the entire analysis since they exert an undue influ- 
ence and can dominate a computed relationship between two 
variables. 

Let us consider the case of outliers during regression for 
the univariate case. Data points are said to be outliers when 
their model residuals are large relative to the other points. 
Instead of blindly using a statistical criterion, a better way 
is to visually look at the data, and distinguish between end 
points and center points. For example, point A of Fig. 3.30 is 
quite obviously an outlier, and if the rejection criterion orders 
its removal, one should proceed to do so. On the other hand, 
point B which is near the end of the data domain, may not be 



Fig. 3.28 Figure illustrating an 
overlay plot for shading calcula- 
tions. The sun-path diagram is 
generated by computing the solar 
declination and azimuth angles 
for a given latitude (for 40° N) 
during different times of the 
day and times of the year. The 
"obstructions" from trees and 
objects are drawn over the graph 
to yield important information of 
potential shading on the collec- 
tor (From Kreider et al. 2009 by 
permission of CRC Press) 
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Fig. 3.29 Scatter plot matrix 
or carpet plots for multivariable 
graphical data analysis. The data 
corresponds to hourly climatic 
data for Phoenix, AZ for Janu- 
ary 1990. The bottom left hand 
comer frame indicates how solar 
radiation in Btu/hr-ft- (x-axis) 
varies with dry-bulb tempera- 
ture (in °F) and is a flipped and 
rotated image of that at the top 
right hand comer. The HR vari- 
able represents humidity ratio 
(in Ibm/lba). Points which fall 
distinctively outside the general 
scatter can be flagged as outliers 
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a bad point at all, but merely the beginning of a new portion 
of the curve (say, the onset of turbulence in an experiment 
involving laminar flow). Similarly, even point C may be valid 
and important. Hence, the only way to remove this ambiguity 
is to take more observations at the lower end. Thus, a modi- 
fication of the statistical rejection criterion is that one should 
do so only if the points to be rejected are center points . 

Several advanced books present formal analytical treat- 
ment of outliers which allow diagnosing whether the regres- 
sor data set is ill-conditioned or not, as well as identifying 
and rejecting, if needed, the necessary outliers that cause 
ill-conditioning (for example, Belsley et al. 1980). Consider 
Fig. 3.31a. The outlier point will have little or no influence 



CO 



DC 




Regressor variable 

Fig. 3.30 Illustrating different types of outliers. Point A is very prob- 
ably a doubtful point; point B might be bad but could potentially be a 
very important point in terms of revealing unexpected behavior; point 
C is close enough to the general trend and should be retained until more 
data is collected 



on the regression parameters identified, and in fact retaining 
it would be beneficial since it would lead to a reduction in 
model parameter variance. The behavior shown in Fig. 3.3 lb 
is more troublesome because the estimated slope is almost 
wholly determined by the extreme point. In fact, one may 
view this situation as a data set with only two data points, or 
one may view the single point as a spurious point and remove 
it from the analysis. Gathering more data at that range would 
be advisable, but may not be feasible; this is where the judg- 
ment of the analyst or prior information about the underlying 
trend line are useful. How and the extent to which each of the 
data points will affect the outcome of the regression line will 
determine whether that particular point is an influence point 
or not. This aspect is treated more formally in Sect. 5.6.2. 
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Fig. 3.31 Two other examples of outlier points. While the outlier point 
in (a) is most probably a valid point, it is not clear for the outlier point 
in (b). Either more data has to be collected, failing which it is advisable 
to delete this data from any subsequent analysis. (From Belsley et al. 
(1980) by permission of John Wiley and Sons) 
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3.6 Overall Measurement Uncertainty 

The International Organization of Standardization (ISO) and 
six other organizations have published guides which have 
established the experimental uncertainty standard (an exam- 
ple is ANSI/ASME 1990). The following material is largely 
drawn from Guideline 2 (ASHRAE 2005) which deals with 
engineering analysis of experimental data. 



3.6.1 Need for Uncertainty Analysis 

Any measurement exhibits some difference between the mea- 
sured value and the true value and, therefore, has an associ- 
ated uncertainty. A statement of measured value without an 
accompanying uncertainty statement has limited meaning. 
Uncertainty is the interval around the measured value within 
which the true value is expected to fall with some stated confi- 
dence level. "Good data" does not describe data that yields the 
desired answer. It describes data that yields a result within an 
acceptable uncertainty interval or, in other words, provides the 
acceptable degree of confidence in the result. 

Measurements made in the field are especially sub- 
ject to potential errors. In contrast to measurements made 
under the controlled conditions of a laboratory setting, field 
measurements are typically made under less predictable 
circumstances and with less accurate and less expensive 
instrumentation. Furthermore, field measurements are vul- 
nerable to errors arising from: 

(a) Variable measurement conditions so that the method 
employed may not be the best choice for all conditions; 

(b) Limited instrument field calibration, because it is typi- 
cally more complex and expensive than laboratory 
calibration; 

(c) Simplified data sampling and archiving methods; and 

(d) Limitations in the ability to adjust instruments in the 
field. 

With appropriate care, many of these sources of error can 
be minimized: (i) through the systematic development of a 
procedure by which an uncertainty statement can be ascribed 
to the result, and (ii) through the optimization of the measure- 
ment system to provide maximum benefit for the least cost. 
The results of a practitioner who does not consider sources of 
error are likely to be questioned by others, especially since the 
engineering community is increasingly becoming sophisti- 
cated and mature about the proper reporting of measured data. 



3.6.2 Basic Uncertainty Concepts: Random 
and Bias Errors 



certain manner. However, the previous version is slightly 
more simplified, and gives results which in many practi- 
cal instances are close enough. It is this which is described 
below (ANSI/ASME 1990). The bias and random errors 
are treated as random variables, with however, different 
confidence level multipliers applied to them as explained 
below (while the latest ISO standard suggests a combined 
multiplier). 

(a) Bias or systematic error (or precision or fixed error) is 
an unknown error that persists and is usually due to the 
particular instrument or technique of measurement (see 
Fig. 3.32). It is analogous to the sensor precision (see 
Sect. 3.2.1). Statistics is of limited use in this case. The 
best corrective action is to ascertain the extent of the bias 
(say, by recalibration of the instruments) and to correct 
the observations accordingly. Fixed (bias) errors are the 
constant deviations that are typically the hardest to esti- 
mate or document. They include such items as mis-cali- 
bration as well as improper sensor placement. Biases are 
essentially offsets from the true value that are constant 
over time and do not change when the number of obser- 
vations is increased. For example, a bias is present if a 
temperature sensor always reads 1 °C higher than the true 
value from a certified calibration procedure. Note that the 
magnitude of the bias is unknown for the specific situa- 
tion; and so measurements cannot be simply corrected. 

(b) Random error (or inaccuracy error) is an error due 
to the unpredictable and unknown variations in the 
experiment that causes readings to take random val- 
ues on either side of some mean value. Measurements 
may be precise or imprecise depending on how well 
an instrument can reproduce the subsequent readings 
of an unchanged input (see Fig. 3.32). Only random 
errors can be treated by statistical methods. There are 
two types of random errors: (i) additive errors that are 
independent of the magnitude of the observations, and 
(ii) multiplicative errors which are dependent on the 
magnitude of the observations (Fig. 3.33). Usually 
instrument accuracy is stated in terms of percent of 
full scale, and so uncertainty of a reading is taken to 
be additive, i.e., irrespective of the magnitude of the 
reading. 

Random errors are differences from one observation to the 
next due to both sensor noise and extraneous conditions affect- 
ing the sensor. The random error changes from one observa- 
tion to the next, but its mean (average value) over a very large 
number of observations is taken to approach zero. Random 
error generally has a well-defined probability distribution 
that can be used to bound its variability in statistical terms as 
described in the next two sub-sections when a finite number of 
observations is made of the same variable. 



The latest ISO standard is described in Coleman and Steele 
(1999) and involves treating bias and random errors in a 
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Fig. 3.32 Effect of measurement 
bias and precision errors 
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Fig. 3.33 Conceptual figures illustrating how additive and multiplica- 
tive errors affect the uncertainty bands around the trend line 

3.6.3 Random Uncertainty of a Measured 
Variable 

Based on measurements of a variable X, the true value of X 
can be specified to lie in the interval (Xj^^^^+U^) where Xi^^^^ 
is usually the mean value of the measurements taken and U 
is the uncertainty in X that corresponds to the estimate of the 
effects of combining fixed and random errors. 



The uncertainty being reported is specific to a confidence 
level''. The confidence level defines the range of values or the 
confidence limits (CL) that can be expected to include the 
true value with a stated probability. For example, a statement 
that the 95% CL are 5.1 to 8.2 implies that the true value will 
be contained between the interval bounded by 5.1 and 8.2 in 
19 out of 20 predictions (95% of the time), or that one is 95% 
confident that the true value lies between 5.1 and 8.2, or that 
there is a 95% probability that the actual value is contained 
in the interval {5.1, 8.2}. 

An uncertainty statement with a low confidence level is 
usually of little use. For the example in the previous exam- 
ple, if a confidence level of 40% is used instead of 95%, the 
interval becomes a tight 7.6 to 7.7. However, only 8 out of 20 
predictions will likely lie between 7.6 and 7.7. Conversely, 
it is useless to seek a 100% CL since then the true value of 
some quantity would lie between plus and minus infinity. 

Multi-sample data (repeated measurements of a fixed 
quantity using altered test conditions, such as different 
observers or different instrumentation or both) provides 
greater reliability and precision than single sample data 



^ Several publications cite uncertainty levels without specifying a cor- 
responding confidence level; such practice should be avoided. 
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(measurements by one person using a single instrument). 
For the majority of engineering cases, it is impractical and 
too costly to perform a true multi-sample experiment. While, 
strictly speaking, merely taking repeated readings with the 
same procedure and equipment does not provide multi- 
sample results, such a procedure is often accepted by the 
engineering community as a fair approximation of a multi- 
sample experiment. 

Depending upon the sample size of the data (greater or 
less than about 30 samples), different statistical consider- 
ations and equations apply. The issue of estimating confi- 
dence levels is further discussed in Chap. 4, but operational 
equations are presented below. These levels or limits are 
directly based on the Gaussian and the Student-t distribu- 
tions presented in Sect. 2.4.3a and b. 

(a) Random Uncertainty in large samples (n> about 30): 
The best estimate of a variable x is usually its mean 
value given by x. The limits of the confidence interval 
are determined from the sample standard deviation s . 
The typical procedure is then to assume that the individ- 
ual data values are scattered about the mean following 
a certain probability distribution function, within (±z. 
standard deviation s ) of the mean. Usually a normal 
probability curve (Gaussian distribution) is assumed 
to represent the dispersion in experimental data, unless 
the process is known to follow one of the standard dis- 
tributions (discussed in Sect. 2.4). For a normal distri- 
bution, the standard deviation indicates the following 
degrees of dispersion of the values about the mean (see 
Table A3). For z= 1.96, 95% of the data will be within 
{±\.96sx) of the mean. Thus, the z multiplier has a 
direct relationship with the confidence level selected 
(assuming a known probability distribution). The confi- 
dence interval (CL) for the mean of n number of multi- 
sample random data, i.e., data which do not have any 
fixed error is: 



interval for the mean value of x, when no fixed (bias) 
errors are present in the measurements, is given by: 



Z.Sx 

-(— =) and Xn 
Jn 



i + (^) (3.16) 



(b) Random uncertainty in small samples (n< about 30). In 
many circumstances, the analyst will not be able to col- 
lect a large number of data points, and may be limited to 
a data set of less than 30 values (n < 30). Under such con- 
ditions, the mean value and the standard deviation are 
computed as before. The z value applicable for the nor- 
mal distribution cannot be used for small samples. The 
new values, called t-values, are tabulated for different 
degrees of freedom d.f. (v=n- 1) and for the acceptable 
degree of confidence (see Table A4^). The confidence 



^ Table A4 applies to critical values for one-tailed distributions, while 
most of the discussion here applies to the two-tailed case. See Sect. 4.2.2 
for the distinction between both. 



t.Sx 

X — (^=) and Xn 
Jn 



t.Sr 

Jn 



(3.17) 



For example, consider the case of d.f. = 1 and two-tailed 
significance level a = 0.05. One finds from Table A4 that 
t= 2.228 for 95% CL. Note that this increases to t= 2.086 
for d.f. = 20 and reaches the z value for 1.96 for d.f. = oo. 

Example 3.6.1 Estimating confidence intervals 

(a) The length of a field is measured 50 times. The mean is 30 

with a standard deviation of 3. Determine the 95% CL. 

This is a large sample case, for which the z 

multiplier is 1.96. Hence, the 95% CL are 

(1.96) -(3) 

= 30±^ Vt^ =30±0.83 = {29.17,30.83} 

(50)'/2 

(b) Only 21 measurements are taken and the same mean and 
standard deviation as in (a) are found. Determine the 
95% CL. 

This is a small sample case for which the t-value=2.086 

for d.f. = 20. Then, the 95% CL will turn out to be wider: 

(2.086) • (3) 
30 ± ^—4^ =30 ±1.37 = {28.63, 31.37} ■ 



(21) 



1/2 



3.6.4 Bias Uncertainty 

Estimating the bias or fixed error at a specified confidence 
level (say, 95% confidence) is described below. The fixed 
error B^^ for a given value x is assumed to be a single value 
drawn from some larger distribution of possible fixed errors. 
The treatment is similar to that of random errors with the 
major difference that only one value is considered even 
though several observations may be taken. Lacking further 
knowledge, a normal distribution is usually assumed. Hence, 
if a manufacturer specifies that the fixed uncertainty B^^ is 
+ 1 .0°C with 95% confidence (compared to some standard ref- 
erence device), then one assumes that the fixed error belongs 
to a larger distribution (taken to be Gaussian) with a standard 
deviation Sg=0.5°C (since the corresponding z-value=2.0). 



3.6.5 Overall Uncertainty 

The overall uncertainty of a measured variable x has to com- 
bine the random and bias uncertainty estimates. Though 
several forms of this expression appear in different texts, a 
convenient working formulation is as follows: 



U, = JB/ + 



(3.18) 
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where: 

U = overall uncertainty in the value x at a specified confi- 
dence level 

B = uncertainty in the bias or fixed component at the speci- 
fied confidence level 

s^= standard deviation estimates for the random component 

n= sample size 

t=t- value at the specified confidence level for the appropri- 
ate degrees of freedom 

Example 3.6.2: For a single measurement, the statistical 
concept of standard deviation does not apply Nonetheless, 
one could estimate it from manufacturer's specifications if 
available. It is desired to estimate the overall uncertainty at 
95% confidence level in an individual measurement of water 
flow rate in a pipe under the following conditions: 

(a) full scale meter reading 150 L/s 

(b) actual flow reading 125 L/s 

(c) random error of instrument is ± 6% of full-scale reading 
at 95% CL 

(d) fixed (bias) error of instrument is ±4% of full-scale 
reading at 95% CL 

The solution is rather simple since all stated uncer- 
tainties are at 95% CL. It is implicitly assumed that the 
normal distribution applies. The random error=150x 
0.06 = ±9L/s. The fixed eiTor=150x0.04 = ±6 L/s. The 
overall uncertainty can be estimated from Eq. 3.18 with n = 1 : 

Ux = (6^ + 9^y'^ = ±10.82 L/s 

t/t 
The fractional overall uncertainty at 95% CL — ^ — 

10.82 



125 



= 0.087 = 8.7% 



Example 3.6.3: Consider Example 3.6.2. In an effort to 
reduce the overall uncertainty, 25 readings of the flow are 
taken instead of only one reading. The resulting uncertainty 
in this case is determined as follows. 

The bias error remains unchanged at ±6 L/s. 

The random error decreases by a factor of ~Jn to 
9/(25)'/^ = ±1.8 L/s 

The overall uncertainty is thus: U =(6^H- 1 .8^)"^=+6.26 L/s 

The fractional overall uncertainty at 95% confidence 

, , U, 6.26 

level=— = = 0.05 = 5.0% 

X 125 

Increasing the number of readings from 1 to 25 reduces 

the relative uncertainty in the flow measurement from + 8.7% 
to +5.0%. Because of the large fixed error, further increase 
in the number of readings would result in only a small reduc- 
tion in the overall uncertainty. ■ 

Example 3.6.4: A flow meter manufacturer stipulates a ran- 
dom error of 5% for his meter at 95.5% CL (i.e., at z = 2). 



Once installed, the engineer estimates that the bias error due 
to the placement of the meter in the flow circuit is 2% at 
95.5% CL. The flow meter takes a reading every minute, but 
only the mean value of 15 such measurements is recorded 
once every 15 min. Estimate the overall uncertainty at 99% 
CL of the mean of the recorded values. 

The bias uncertainty can be associated with the normal 
tables. From Table A3, z = 2.575 has an associated probabil- 
ity of 0.01 which corresponds to the 99% CL. Since 95.5% 
CL corresponds to z = 2, the bias uncertainty at one standard 
deviation =1%. 

Since the number of observations is less than 30, the stu- 
dent-t table has to be used for the random uncertainty compo- 
nent. From Table A4, the critical t value for d.f.= 15- 1 = 14 
and significance level of 0.01 is equal to 2.977. Also, the 

5.0 
random uncertainty at one standard deviation = — =2.5% 

Hence, the overall uncertainty of the recorded values at 
99% CL 



t/,= [(2.575). If 



0.0322 = 3.22% 



(2.977).(2.5)' 



n2 



(15) 



1/2 



1/2 



3.6.6 



Chauvenet's Statistical Criterion of Data 
Rejection 



The statistical considerations described above can lead to 
analytical screening methods which can point out data errors 
not flagged by graphical methods alone. Though several 
types of rejection criteria can be formulated, perhaps the best 
known is the Chauvenet's criterion. This criterion, which 
presumes that the errors are normally distributed and have 
constant variance, specifies that any reading out of a series 
of n readings shall be rejected if the magnitude of its devia- 
tion d from the mean value of the series is such that the 

max 

probability of occurrence of such a deviation exceeds (l/2n). 
It is given by: 



Sr 



0.819 ± 0.544. In (n) - 0.02346. \n{n^) (3.19) 



where s is the standard deviation of the series and n is the 

X 

number of data points. The deviation ratio for different num- 
ber of readings is given in Table 3.7. For example, if one 
takes 10 observations, an observation shall be discarded if its 
deviation from the mean is J„,ax > (1.96)i'v . 

This data rejection should be done only once, and more 
than one round of elimination using the Chauvenet criterion 
is not advised. Note that the Chauvenet criterion has inherent 
assumptions which may not be justified. For example, the 
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Table 3.7 Table for Chauvenet' 
ingEq. 3.19 


■, criterion of rejecting outliers foUow- 


Number of readings N 


Deviation ratio d^^^JS^ 


2 




1.15 


3 




1.38 


4 




1.54 


5 




1.65 


6 




1.73 


7 




1.80 


10 




1.96 


15 




2.13 


20 




2.31 


25 




2.33 


30 




2.51 


50 




2.57 


100 




2.81 


300 




3.14 


500 




3.29 


1000 




3.48 



underlying distribution may not be normal, but could have 
a longer tail. In such a case, one may be throwing out good 
data. A more scientific manner of dealing with outliers which 
also yields similar results is to use weighted regression or 
robust regression, where observations farther away from the 
mean are given less weight than those from the center (see 
Sect. 5.6 and 9.6.1 respectively). 



3.7 Propagation of Errors 

In many cases, the variable of interest is not directly mea- 
sured, but values of several associated variables are mea- 
sured, which are then combined using a data reduction 
equation to obtain the value of the desired result. The objec- 
tive of this section is to present the methodology to estimate 
overall uncertainty from knowledge of the uncertainties in 
the individual variables. The random and fixed components, 
which together constitute the overall uncertainty, have to be 
estimated separately. The treatment that follows, though lim- 
ited to random errors, could also apply to bias errors. 



3.7.1 Taylor Series Method for Cross-Sectional 
Data 

In general, the standard deviation of a function y=y(Xj, x,, 
..., xj, whose independently measured variables are all 
given with the same confidence level, is obtained by the first 
order expansion of the Taylor series: 



N 




(3.20) 



where: 

s = function standard deviation 

y 

s = standard deviation of the measured quantity x. 
Neglecting terms higher than the first order (as implied by a 
first order Taylor Series expansion), the propagation equa- 
tions for some of the basic operations are given below. Let x^ 
and X, have standard deviations s^ and s^. Then: 



Addition or subtraction: y — xi±x2 and 



Sy = (i_,l + S^2) 



1/2 



(3.21) 



Multiplication: y — x\.X2 and 

Sy = (X1.X2) 

Division: _y — xxjx^ and 



+ 



X2 



1/2 



5v 



Xl 



Sx2 

X2 



1/2 



(3.22) 



(3.23) 



For multiplication and division, the fractional error is given 
by the same expression. If y = ^^, then the fractional 
standard deviation: 






y 



Sxl 

xi2 



Sx2 
X2^ 



Sxi 



1/2 



(3.24) 



The uncertainty in the result depends on the squares of the 
uncertainties in the independent variables. This means that if 
the uncertainty in one variable is larger than the uncertainties 
in the other variables, then it is the largest uncertainty that 
dominates. To illustrate, suppose there are three variables 
with an uncertainty of magnitude 1 and one variable with 
an uncertainty of magnitude 5. The uncertainty in the result 
would be {5^+ \^+\^+ P)''^ = (28)"^=5.29. Clearly, the effect 
of the largest uncertainty dominates the others. 

An analysis involving relative magnitude of uncertainties 
plays an important role during the design of an experiment 
and the procurement of instrumentation. Very little is gained 
by trying to reduce the "small" uncertainties since it is the 
"large" ones that dominate. Any improvement in the over- 
all experimental result must be achieved by improving the 
instrumentation or experimental technique connected with 
these relatively large uncertainties. This concept is illustrated 
in Example 3.7.2 below. 

Equation 3.20 applies when the measured variables are 
uncorrelated. If they are correlated, their interdependence 
can be quantified by the covariance (defined by Eq. 3.9). 
If two variables x^ and x^ are correlated, then the standard 
deviation of their sum is given by: 
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Table 3.8 Error table of the four 
quantities that define the Reyn- 
olds number (Example 3.7.2) 



Quantity 


Minimum 
flow 


Maximum 
flow 


Random error at full 
flow (95% CL) 


% errors' 






Minimimi 


Maximum 


Velocity m/s (V) 


1 




20 




0.1 


10 


0.5 


Pipe diameter m (d) 


0.2 




0.2 













Density kg/m' (p) 


1000 




1000 




1 


0.1 


0.1 


Viscosity kg/m -s (ji) 


1.12X 


10-3 


1.12X 


10-3 


0.45x10-3 


0.4 


0.4 



' Note that the last two columns under "% error" are computed from the previous three columns of data 



^.il^ -1-^.1-2^ 



2.cov(xi,X2).xi.X2 (3.25) 



Another method of deaUng with propagation of errors is to 
adopt a perturbation approach. To simplify this computa- 
tion, a computer routine can be written to perform the task 
of calculating uncertainties approximately. One method 
is based on approximating partial derivatives by a central 
finite-difference approach. If y=y(Xj, x^, ... x^), then: 

Qy _ y{x\ -t- Axi, X2,...)-y(xi - Axi, X2,...) 
9x1 



2.Axi 
dy y{xi, X2 + Ax2, ...) - y(xi, X2 - Ax2...) 



9X2 



2.Ax2 



etc. 



(3.26) 



No strict rules for the size of the perturbation or step size Ax 
can be framed since they would depend on the underlying 
shape of the function. Perturbations in the range of 1-4% 
of the value are reasonable choices, and one should evalu- 
ate the stability of the partial derivative computed numeri- 
cally by repeating the calculations for a few different step 
sizes. In cases involving complex experiments with extended 
debugging phases, one should update the uncertainty analy- 
sis whenever a change is made in the data reduction pro- 
gram. Commercial software programs are also available with 
in-built uncertainty propagation formulae. This procedure is 
illustrated in Example 3.7.4 below. 

Example 3.7.1: Uncertainty in overall heat transfer 
coefficient 

The equation of the over-all heat-transfer coefficient U of 
a heat exchanger consisting of a fluid flowing inside and 
another fluid flowing outside a steel pipe of negligible ther- 
mal resistance is \J = {\l\\^ + \l\\^y^ = {h^\\J{h^+\\^) where 
h| and h, are the individual coefficients of the two fluids. If 
h, = 15 W/m-°C with a fractional error of 5% at 95% CL and 
h2 = 20 W/m-°C with a fractional error of 3%, also at 95% 
CL, what will be the fractional error in random uncertainty 
of the U coefficient at 95% CL assuming bias error to be 
zero? 

In order to use the propagation of error equation, the par- 
tial derivatives need to be computed. One could proceed to 
do so analytically using basic calculus. Then: 



9U\ ^ h2(hi+h2)-hi/i2 
5hiA, ~ (hi+h2)' 



and 



9U\ 
51^ A, 



hi(hi+h2)-hi/i2 
(hi+h2)' 



hi 

{hy+h2f 



h\ 
{hy+hjf 



(3.27a) 



(3.27b) 



The expression for the fractional uncertainty in the overall 
heat transfer coefficient U is: 



W 



U 



(hi + h2)^ 



Sh, 



+ 



W 



(hi + h2)^ V h2 



Sh2 



(3.28) 



Plugging numerical values, one gets U=8.571, while the 
partial derivatives given by Eqs. 3.27 are computed as: 



dU 
dh'. 



= 0.3265 and 



dU 

Wo 



= 0.1837 



The two terms on the right hand side of Eq. 3.28 provide 
insight into the relative contributions of h, and h^. These are 
estimated as 16.84% and 83.16% indicating that the latter is 
the dominant one. 

Finally, 8^=0.2686 yielding a fractional error (SjjAJ) = 
3.1% at 95% CL ■ 

Example 3.7.2': Relative error in Reynolds number of flow 
in a pipe 

Water is flowing in a pipe at a certain measured rate. The 
temperature of the water is measured and the viscosity and 
density are then found from tables of water properties. Deter- 
mine the probable errors of the Reynolds numbers (Re) at the 
low and high flow conditions given the following informa- 
tion (Table 3.8): 

Recall that Re = ^^ . From Eq. 3.24, at minimum flow 
condition, the relative error in Re is: 



ARe 
~R^ 



0.1 



1 



0.45 



1/2 



(O.r 



1000/ Vl-12, 
0.001^ + 0.004^)'/^ = 0.1 or 10% 



' Adapted from Schenck (1969) by permission of Mc Graw-Hill. 
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0) 
DC 



a 
en 




Reynolds number (Re) 

Fig. 3.34 Expected variation in experimental relative error with magni- 
tude of Reynolds number (Example 3.7.2) 



(to within 4 decimal points) — note that there is no error in 
pipe diameter value. At maximum flow condition, the per- 
centage error is: 



ARe 
lle~ 



(0.005^ + 0.00^ + 0.004^) 



2^.1/2 



0.0065 or 0.65% 



The above example reveals that (i) at low flow conditions 
the error is 10% which reduces to 0.65% at high flow con- 
ditions, and (ii) at low flow conditions the other sources of 
error are absolutely dwarfed by the 10% eiTor due to flow 
measurement uncertainty. Thus, the only way to improve 
the experiment is to improve flow measurement accuracy. If 
the experiment is run without changes, one can confidently 
expect the data at the low flow end to show a broad scat- 
ter becoming smaller as the velocity is increased. This phe- 
nomenon is captured by the confidence intervals shown in 
Fig. 3.34. ■ 

Example 3.7.3: Selecting instrumentation during the exper- 
imental design phase 

An experimental program is being considered involving con- 
tinuous monitoring of a large chiller under field conditions. 
The objective of the monitoring is to determine the chiller 
Coefficient of Performance (COP) on an hourly basis. The 
fractional uncertainty in the COP should not be greater than 
5% at 95% CL. The rated full load is 450 tons of cooling 
(1 ton= 12,000 BTU/h). The chiller is operated under con- 
stant chilled water and condenser water flow rates. Only ran- 
dom errors are to be considered. 



The COP of a chiller is defined as the ratio of the amount 
of cooling at the evaporator (Q^,,) to the electric power (E) 
consumed: 



COP 



Q 



ch 



(3.29) 



while power E can be measured directly, the amount of cool- 
ing Q^i^ has to be determined by individual measurements 
of the chilled water volumetric flow rate and the difference 
between the supply and return chilled water temperatures 
along with water properties. 

Qch^pVcAT (3.30) 

where: 

p = density of water, 

V= chilled water volumetric flow rate, assumed constant 
during operation (= 1080 gpm), 

c= specific heat of water, 

AT = temperature difference between the entering and leav- 
ing chilled water at the evaporator (which changes dur- 
ing operation) 

The fractional uncertainty in COP (neglecting the small 

effect of uncertainties in the density and specific heat) is: 



Ur 



COP 




(3.31) 



Note that since this is a preliminary uncertainty analysis, 
only random (precision) errors are considered. 
1. Let us assume that the maximum flow reading of the 
selected meter is 1500 gpm and has 4% uncertainty 
at 95% CL. This leads to an absolute uncertainty of 
( 1500x0.04) = 60 gpm. The first term ^is a constant 
and does not depend on the chiller load since the flow 
through the evaporator is maintained constant. The rated 
chiller flow rate is 1080 gpm. Thus 



V 



Vio8oy 



= 0.0031 and 



V 



= ±0.056. 



Assume that for the power measurement, the instrument 
error at 95% CL is 4.0, calculated as 1% of the instru- 
ment full scale value of 400 kW. The chiller rated capac- 
ity is 450 tons of cooling, with an assumed realistic lower 
bound of 0.8 kW per tons of cooling. The anticipated elec- 
tric draw at full load of the chiller=0.8x450=360 kW. 
The fractional uncertainty at full load is then: 



E 



4.0 
360 



: 0.00012 and 



Uk 



±0.011 



Thus, the fractional uncertainty in the power is about five 
times less that of the flow rate. 
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3. The random (precision) error at 95% CL for the type of 
commercial grade sensor to be used for temperature mea- 
surement is 0.2°F. Consequently, the error in the mea- 
surement of temperature difference AT=(0.2-H-0.2^)"^= 
0.28°F. From manufacturer catalogs, the temperature 
difference between supply and return chilled water tem- 
peratures at full load can be assumed to be 10°F. The frac- 
tional uncertainty at full load is then 



m-i 



0.28Y Uat 

= 0.00078 and -^ = ±0.078 

10 7 AT 



4. Propagation of the above errors yields the fractional 
uncertainty at 95% CL at full chiller load of the measured 
COP: 



Ur 



COP 



(0.0031 + 0.00012 + 0.00078) 



= 0.063 = 6.3% 



1/2 



It is clear that the fractional uncertainty of the proposed instru- 
mentation is not satisfactory for the intended purpose. 

The logical remedy is to select a more accurate flow meter 
or one with a lower maximum flow reading. ■ 

Example 3.7.4: Uncertainty in exponential growth models 
Exponential growth models are used to model several com- 
monly encountered phenomena, from population growth 
to consumption of resources. The amount of resource con- 
sumed over time Q(t) can be modeled as: 



-f 



Q(t) = / Poe"dt = yie'- 





1) 



(3.32a) 



Table 3.9 Numerical computation of the partial derivatives of t with 
Q andr 



Multiplier 


Assuming Q= 1000 

r t (from Eq. 3.32b) 


Assuming r= 0.027 




Q 


t (froin Eq. 3.32b) 


0.99 


0.02673 69.12924 


990 


68.43795 


1.00 


0.027 68.75178 


1000 


68.75178 


1.01 


0.02727 68.37917 


1010 


69.06297 



tainties of both quantities are taken to be normal with 
one standard deviation values of 0.2% (absolute) and 
10% (relative) respectively, determine the lower and 
upper estimates of the years to depletion at the 95% 
confidence level. 
Though the partial derivatives can be derived analytically, 
the use of Eq. 3.26 will be illustrated so as to compute them 
numerically. Let us use Eq. 3.32b with a perturbation multi- 
plier of 1% to both the base values of r (=0.027) and of Q 
(= 1000). The pertinent results are assembled in Table 3.9. 
From here: 



dt _(68.37917 
dr (0.02727 
dt _ (69.06297 



69.12924) 

0.02673) 

68.43795) 



(1010-990) 



= -1389 and 



0.03125 



Then: 



dt 
Vr 



s,. 



+ 



dt 



Sq 



1/2 



= {[ - 1389)(0.002)]2 + [(0.03125)(0.1)(1000)]2}i/2 

= (2.778^ + 3.1252)1/2 =4.181 



where Pu= initial consumption rate, and r= exponential rate 
of growth 

The world coal consumption in 1986 was equal to 5.0 bil- 
lion (short) tons and the estimated recoverable reserves of 
coal were estimated at 1000 billion tons, 
(a) If the growth rate is assumed to be 2.7% per year, how 

many years wUl it take for the total coal reserves to be 

depleted? 

Rearranging Eq. 3.32a results in 



1 



t — 



In 1 + 



Q-r 



(3.32b) 



Or 



/ = . In 

0.027 



(1000)(0.027)' 



— 68.75 years 



(b) Assume that the growth rate r and the recoverable 
reserves are subject to random uncertainty. If the uncer- 



Thus, the lower and upper limits at the 95% CL (with the 
z= 1.96) is 

= 68.75 ± (1.96)4.181 = {60.55,76.94} years 

The analyst should repeat the above procedure with, say, a 
perturbation multiplier of 2% in order to evaluate the sta- 
bility of the numerically derived partial derivatives. If these 
differ substantially, it is urged that the function be plotted 
and scrutinized for irregular behavior around the point of 
interest. ■ 



3.7.2 Taylor Series Method for Time Series Data 

Uncertainty in time series data differs from that of stationary 

data in two regards: 

(a) the uncertainty in the dependent variable y at a given 
time t depends on the uncertainty at the previous time 
y^_^, and thus, uncertainty compounds over consecutive 
time steps, i.e., over time; and 
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Chiller 



^ 




Ice 

tanks^ — ^ 





Fig. 3.35 Schematic layout of a cool storage system with the chiller 
located upstream of the storage tanks for Example 3.7.5. (From Dorgan 
and EUeson 1994 © American Society of Heating, Refrigerating and 
Air-conditioning Engineers, Inc., www.ashrae.org) 



(b) some or all of the independent variables x may be cross- 
correlated, i.e., they have a tendency to either increase 
or decrease in unison. 
The effect of both these factors is to increase the uncer- 
tainty as compared to stationary data (i.e., data without time- 
wise behavior). Consider the function shown below: 

y = /(xi,x2,x3) (3.33) 

Following Eq. 3.25, the equation for the propagation of ran- 
dom errors for a data reduction function with variables that 
exhibit cross correlation (case (b) above) is given by: 

Ul =[(t/,,.5C,,)2 + {U,,.SC,,f + (U,,.SC,,f 

~r ^-^xxXT^-^^xi-^^Xi -^ x\ ^x3 

I ^■^X2X^-^^X2-^^Xi-^X2*-'Xil 

where 

U^^ is the uncertainty of variable x. 



(3.34) 



5C ,=the sensitivity coefficients of y to variable x = g^-, and 
Tv V =correlation coefficient between variables x. and x . 

Example 3.7.5: Temporal Propagation of Uncertainty in 
ice storage inventory 

The concept of propagation of errors can be illustrated with 
time-wise data for an ice storage system. Figure 3.35 is a 
schematic of a typical cooling system comprising of an 
upstream chiller charging a series of ice tanks. The flow to 
these tanks can be modulated by means of a three-way valve 
when partial charging or discharging is to be achieved. The 
building loads loop also has its dedicated pump and three- 
way valve. It is common practice to charge and discharge 
the tanks uniformly. Thus, they can be considered to be one 
large consolidated tank for analysis purposes. The inventory 
of the tank is the cooling capacity available at any given 
time, and is an important quantity for the system operator 



to know because it would dictate the operation of the chiller 
and how much to either charge or discharge the chiller at 
any given time. Unfortunately, the direct measurement of 
this state is difficult. Sensors can be embedded inside the 
tanks, but this measurement is usually unreliable. Hence, it 
is more common for analysts to use the heat balance method 
to deduce the state of charge. An energy balance on the tank 
yields: 



dQ 

dt 



— ^in ^los. 



(3.35) 



where 

g= stored energy amount or inventory of the storage system 

(say, in kWh or Ton-hours) 
t=time 

q. =rate of energy flow into (or out of) the tank due to the 
secondary coolant loop during charging (or discharging) 
9,^ ^= rate of heat lost by tank to surroundings 
The rate of energy flow into or out of the tank can be deduced 
by measurements from: 

qin = VpCpiJin - Tout) (3.36) 

where 

V= volumetric flow rate of the secondary coolant 

p = density of the coolant 

c = specific heat of coolant 

T =exit temperature of coolant from tank 

QUI ^ 

T. = inlet temperature of coolant to tank 

The two temperatures and the flow rate can be measured, and 

thereby q. can be deduced. 

The rate of heat loss from the tank to the surroundings can 
also be calculated as: 



qioss — UA(Ts — Tamb) 



(3.37) 



where 

K4 = effective overall heat loss coefficient of tank 
T , = ambient temperature 
r= average storage temperature 

The UA value can be determined from the physical construc- 
tion of the tank and the ambient temperature measured. 
Combining all three above equations: 



dQ, 

dt 



VpCp{Tin - Tout) - UA{T, - Tamb) 



(3.38) 



Expressing the time rate change of heat transfer in terms of 
finite differences results in an expression for stored energy at 
time (t) with respect to time (t- 1): 



AQ = Q,- Qt-i = At.[C.AT - UA.(T, - T^^b)] 



where 

A?=time step at which observations are made (say, 1 h). 



(3.39a) 
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Table 3.1 Storage inventory and uncertainty propagation table for Example 3.7.5 
Hour Mode of Storage variables 

Ending storage 
(t) 



Idle 



Change in Storage 

storage capacity 

A6, (k Wh) e, (kWh) 

2967 



Inlet fluid 
temp (°C) 



Exit fluid 
temp (°C) 



Total flow 
rate V (L/s) 



95% CL Uncertainty in storage capacity 

Change in U ^ Absolute 

Uncertainty U /2„,^j 

JEq. 3.40c) '' 

0.00 



0.00 



0.000 



Relative 



0.000 



Idle 







2967 



0.00 



0.00 



0.000 



0.000 



10 


Discharging 


-183 


2784 


4.9 


0.1 


9.08 


1345.72 


36.68 


0.012 


0.013 


11 




-190 


2594 


5.5 


0.2 


8.54 


1630.99 


54.56 


0.018 


0.021 


12 




-327 


2266 


6.9 


0.9 


12.98 


2116.58 


71.37 


0.024 


0.031 


13 




-411 


1855 


7.8 


2.1 


17.17 


1960.29 


83.99 


0.028 


0.045 


14 




-461 


1393 


8.3 


3.1 


21.11 


1701.69 


93.57 


0.032 


0.067 


15 




-443 


950 


8.1 


3.4 


22.44 


1439.15 


100.97 


0.034 


0.106 


16 




-260 


689 


6.2 


1.7 


13.76 


1223.73 


106.86 


0.036 


0.155 


17 




-165 


524 


5.3 


1.8 


11.22 


744.32 


110.28 


0.037 


0.210 


18 


Idle 





524 


- 


- 


- 


0.00 


110.28 


0.037 


0.210 


19 







524 


- 


- 


- 


0.00 


110.28 


0.037 


0.210 


20 







524 


- 


- 


- 


0.00 


110.28 


0.037 


0.210 


21 







524 


- 


- 


- 


0.00 


110.28 


0.037 


0.210 


22 







524 


- 


- 


- 


0.00 


110.28 


0.037 


0.210 


23 


Charging 


265 


847 


-3.3 


-0.1 


19.72 


721.59 


113.51 


0.038 


0.134 


24 




265 


1112 


-3.4 


-0.2 


19.72 


721.59 


116.64 


0.039 


0.105 


1 




265 


1377 


-3.4 


-0.2 


19.72 


721.59 


119.70 


0.040 


0.087 


2 




265 


1642 


-3.6 


-0.3 


19.12 


750.61 


122.79 


0.041 


0.075 


3 




265 


1907 


-3.6 


-0.4 


19.72 


721.59 


125.70 


0.042 


0.066 


4 




265 


2172 


-3.8 


-0.6 


19.72 


721.59 


128.53 


0.043 


0.059 


5 




265 


2437 


-4 


-0.8 


19.72 


721.59 


131.31 


0.044 


0.054 


6 




265 


2702 


-4.4 


-1.1 


19.12 


750.61 


134.14 


0.045 


0.050 


7 




265 


2967 


-4.8 


-1.6 


19.72 


721.59 


136.80 


0.046 


0.046 



Ar= temperature difference between inlet and outlet fluid 2 

QJ 



U, 



Q,t~\ 



= ^t.[(Uc■^T) 

v2 



(C.t/Arr+2rc.Ar.C.Ar.C/cf/Ar] 

(3.40a) 



streams, and 
C=heat capacity rate of the fluid= F.p.c 
So as to simplify this example, the small effect of heat losses 
is neglected (in practice, it is small but not negligible). Then where C is the heat capacity rate of the fluid which changes 
Eq. 3.39a reduces to: hourly. 

Assuming further that variables C and AT are uncorre- 
(3.39b) j^jg^^ gq 3 4Q^ reduces to: 



^Q = Q,-Qt-x ^ At.C.AT 



Thus, knowing the state of charge Q^^ where (t-1) could 
be the start of the operational cycle when the storage is fully 
charged, one can keep track of the state of charge over the 
day by repeating the calculation at hourly time steps. Unfor- 
tunately, the uncertainty of the inventory compounds because 
of the time series nature of how the calculations are made. 
Hence, determining this temporal uncertainty is a critical 
aspect. 

Since the uncertainties in the property values for den- 
sity and specific heat of commonly used coolants are much 
smaller than the other terms, the effect of their uncertainty 
can be neglected. Therefore, the following equation can be 
used to calculate the random error propagation of time-wise 
data results for this example. 



[/, 



Q,i 



U, 



e.'-i 



At.[iUc-^Tf + (CUatY] (3.40b) 



If needed, a similar expression can be used for the fixed 
error. Finally, the quadratic sum of both uncertainties would 
yield the total uncertainty. 

Table 3.10 assembles hourly results of an example struc- 
tured similarly to one from a design guide (Dorgan and Elle- 
son 1994). This corresponds to the hour by hour performance 
of a storage system such as that shown in Fig. 3.35. The stor- 
age is fully charged at the end of 7:00 am where the daily 
cycle is assumed to start. The status of the storage inven- 
tory is indicated as either charging/discharging/idle, while 
the amount of heat flow in or out and the running inventory 
capacity of the tank are shown in columns 3 and 4. The two 
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Table 3.11 Mag 

used in Example 


nitude and associated uncertainty of various quantities 
3.7.5 


Quantity 


Symbol 


Value Random uncertainty at 
95% CL 


Density of water 


P 


1000 kg/m^ 0.0 


Specific heat of 
water 


c 

p 


4.2kJ/kg°C 0.0 


Temperature 


T 


°c o.rc 


Flow rate 


V 


L/s [/^= 6% of full 

scale reading of 30 
L/s=1.8L/s = 6.48m-'/hr 



Temperature 
difference 



AT 



inlet and outlet temperatures and the fluid flow through the 
tank are also indicated. These are the operational variables 
of the system. Table 3.11 gives numerical values of the perti- 
nent variables and their uncertainty values which are used to 
compute the last four columns of the table. 

The uncertainty at 95% CL in the fluid flow rate into the 
storage is: 

Uc = pCpUv = (1000).(4.2)(6.48) = 27,216 kJ/hr-°C 
= 7.56kWh/hr-°C 

Inserting numerical values in Eq. 3.40b and setting the time 
step as one hour, one gets 

l^Qj - ^Qj-i = 10 -56) AT f + [C.(0.141)]2 kWh/hr-°C 

(3.40c) 

The uncertainty at the start of the calculation of the stor- 
age inventory is taken to be 0% while the maximum storage 
capacity Q =2967 kWh. Equation 3.40c is used at each 
time step, and the time evolution of the uncertainty is shown 
in the last two columns both as a fraction of the maximum 
storage capacity (referred to as "absolute", i.e., [U„/Q^^^^J) 
and as a relative uncertainty, i.e., as [U„/Q^. The variation 
of both these quantities is depicted graphically in Fig. 3.36. 
Note that the absolute uncertainty at 95% CL increases to 
4.6% during the course of the day, while the relative uncer- 
tainty goes up to 21% during the hours of the day when the 
storage is essentially depleted. Further, note that various 
simplifying assumptions have been made during the above 
analysis; a detailed evaluation can be quite complex, and so, 
whenever possible, simplifications should be made depend- 
ing on the specific system behavior and the accuracy to 
which the analysis is being done. ■ 



3.7.3 Monte Carlo Method 

The previous method of ascertaining uncertainty, namely 
based on the first order Taylor series expansion is widely 



used; but it has limitations. If uncertainty is large, this method 
may be inaccurate for non-linear functions since it assumes 
derivatives based on local functional behavior. Further, an 
implicit assumption is that errors are normally distributed. 
Finally, in many cases, deriving partial derivatives of com- 
plex analytical functions is a tedious and error-prone affair, 
and even the numerical approach described and illustrated 
above is limited to cases of small uncertainties. A more 
general manner of dealing with uncertainty propagation is 
to use Monte Carlo methods, though these are better suited 
for more complex situations (and treated at more length in 
Sects. 11.2.3 and 12.2.7). These methods are numerical 
methods for solving problems involving random numbers 
and require considerations of probability. Monte Carlo, in 
essence, is a process where the individual basic variables 
or inputs are sampled randomly from their prescribed prob- 
ability distributions so as to form one repetition (or run or 
trial). The corresponding numerical solution is one possible 
outcome of the function. This process of generating runs is 
repeated a large number of times resulting in a distribution 
of the functional values which can then be represented as 
probability distributions, or as histograms, or by summary 
statistics or by confidence intervals for any percentile thresh- 
old chosen. The last option is of great importance in cer- 
tain types of studies. The accuracy of the results improves 
with the number of runs in a square root manner. Increasing 
the number of runs 100 times will approximately reduce the 
uncertainty by a factor of 10. Thus, the process is computer 
intensive and requires thousands of runs be performed. How- 
ever, the entire process is simple and easily implemented on 
spreadsheet programs (which have inbuilt functions for gen- 
erating pseudo-random numbers of selected distributions). 
Specialized software programs are also available. 

There is a certain amount of uncertainty associated with 
the process because Monte Carlo simulation is a numerical 
method. Several authors propose approximate formulae for 
determining the number of trials, but a simple method is as 
follows. Start with a large number of trials (say, 1000), and 
generate pseudo random numbers with the assumed prob- 
ability distribution. Since they are pseudo-random, the mean 
and the distribution (say, the standard deviation) may devi- 
ate somewhat from the desired ones (which depend on the 
accuracy of the algorithm used). Generate a few such sets 
and pick one which is closest to the desired quantities. Use 
this set to simulate the corresponding values of the function. 
This can be repeated a few times till one finds that the mean 
and standard deviations stabilize around some average val- 
ues which are taken to be the answer. It is also urged that 
the analyst evaluate the effect of the results with different 
number of trials; say, using 3000 trials, and ascertaining that 
the results of both the 1000 trial and 3000 trials are similar. 
If they are not, sets with increasingly large number of trials 
should be used till the results converge. 
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Table 3.12 The first few and last few calculations used to determine 
uncertainty in variable t using the Monte Carlo method (Example 3.7.6) 
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Fig. 3.36 Time variation of the absolute and relative uncertainties at 
95% CL of the ice storage inventory for Example 3.7.5 

The approach is best understood by means of a simple 
example. 

Example 3.7.6: Using Monte Carlo to determine uncer- 
tainty in exponential growth models 

Let us solve the problem given in Example 3.7.4 by the 
Monte Carlo method. The approach involves setting up a 
spreadsheet table as shown in Table 3.12. Since only two 
variables (namely Q and r) have uncertainty, one needs only 
assign two columns to these and a third column to the desired 
quantity, i.e. time t over which the total coal reserves will be 
depleted. The first row shows the calculation using the mean 
values and one sees that the value of t=68.75 as found in part 
(a) of Example 3.7.4 is obtained (this is done for verifying 
the cell formula). The analyst then generates random num- 
bers of Q and r with the corresponding mean and standard 
deviations as specified and shown in the first row of the table. 
Monte Carlo methods, being numerical methods, require that 
a large sample be generated in order to obtain reliable results. 
In this case, 1000 normal distribution samples were generated, 
and the first few and last few rows are shown in Table 3.12 Even 
with 1000 samples, one finds that the sample mean and standard 
deviation deviate somewhat from the desired ones because of 
the pseudo-random nature of the random numbers generated by 
the spreadsheet program. For example, instead of having (1000, 
1 00) for the mean and standard deviation of Q, the 1 000 samples 
have (1005.0, 101.82). On the other hand, the differences for r 
are negligible. The corresponding mean and standard deviation 



Run# 


Q(1000, 100) 


r (0.027, 0.002) 


t (years) 


1 


1000.0000 


0.0270 


68.7518 


2 


1050.8152 


0.0287 


72.2582 


3 


1171.6544 


0.0269 


73.6445 


4 


1098.2454 


0.0284 


73.2772 


5 


1047.5003 


0.0261 


69.0848 


6 


1058.0283 


0.0247 


67.7451 


7 


946.8644 


0.0283 


68.5256 


8 


1075.5269 


0.0277 


71.8072 


9 


967.9137 


0.0278 


68.6323 


10 


1194.7164 


0.0262 


73.3758 


11 


747.9499 


0.0246 


57.2155 


12 


1099.7061 


0.0269 


71.5707 


13 


1074.3923 


0.0254 


69.1221 


14 


1000.2640 


0.0265 


68.2233 


15 


1071.4876 


0.0274 


71.3437 


983 


1004.2355 


0.0282 


70.1973 


984 


956.4792 


0.0277 


68.1372 


985 


1001.2967 


0.0293 


71.3534 


986 


1099.9830 


0.0306 


75.7549 


987 


1033.7338 


0.0267 


69.4667 


988 


934.5567 


0.0279 


67.6464 


989 


1055.7171 


0.0282 


71.8201 


990 


1133.6639 


0.0278 


73.6712 


991 


997.0123 


0.0252 


66.5173 


992 


896.6957 


0.0257 


63.8175 


993 


1056.2361 


0.0283 


71.9108 


994 


1033.8229 


0.0298 


72.8905 


995 


1078.6051 


0.0295 


73.9569 


996 


1137.8546 


0.0276 


73.4855 


997 


950.8749 


0.0263 


66.3670 


998 


1023.7800 


0.0264 


68.7452 


999 


950.2093 


0.0248 


64.5692 


1000 


849.0252 


0.0247 


61.0231 


mean 


1005.0 


0.0272 


68.91 


stdev. 


101.82 


0.00199 


3.919 



of t are found to be (68.91, 3.919) compared to the previously 
estimated values of (68.75, 4.181). This difference is not too 
large, but the pseudo-random generation of the values for Q is 
rather poor and ought to be improved. Thus, the analyst should 
repeat the Monte Carlo simulation a few times with different 
seeds for the random number generator; this is Mkely to result in 
more robust estimates. ■ 



3.8 Planning a Non-intrusive Field 
Experiment 

Any experiment should be well-planned involving several 
rational steps (for example, ascertaining that the right sensors 
and equipment are chosen, that the right data collection pro- 
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tocol and scheme are followed, and that the appropriate data 
analysis procedures are selected). It is advisable to explicitly 
adhere to the following steps (ASHRAE 2005): 

(a) Identify experimental goals and acceptable accuracy 
Identify realistic experimental goals (along with some 
measure of accuracy) that can be achieved within the 
time and budget available for the experiment. 

(b) Identify variables and relationships 

Identify the entire list of relevant measurable variables 
that should be examined. If some are inter-dependent, 
or if some are difficult to measure, find alternative 
variables. 

(c) Establish measured variables and limits 

For each measured variable, determine its theoretical 
limits and expected bounds to match the selected instru- 
ment limits. Also, determine instrument limits - all sen- 
sor and measurement instruments have physical limits 
that restrict their ability to accurately measure quanti- 
ties of interest. 

(d) Preliminary instrumentation selection 

Selection of the equipment should be based on accuracy, 
repeatability and features of the instrument increase, 
as well as cost. Regardless of the instrument chosen, 
it should have been calibrated within the last twelve 
months or within an interval required by the manufac- 
turer, whichever is less. The required accuracy of the 
instrument will depend upon the acceptable level of 
uncertainty for the experiment. 

(e) Document uncertainty of each measured variable 
Utilizing information gathered from manufacturers or 
past experience with specific instrumentation, document 
the uncertainty for each measured variable. This infor- 
mation will then be used in estimating the overall uncer- 
tainty of results using propagation of error methods. 

(f) Perform preliminary uncertainty analysis 

An uncertainty analysis of proposed measurement 
procedures and experimental methodology should be 
completed before the procedures and methodology are 
finalized in order to estimate the uncertainty in the final 
results. The higher the accuracy required of measure- 
ments, the higher the accuracy of sensors needed to 
obtain the raw data. The uncertainty analysis is the basis 
for selection of a measurement system that provides 
acceptable uncertainty at least cost. How to perform 
such a preliminary uncertainty analysis was discussed in 
Sect. 3.6 and 3.7. 

(g) Final instrument selection and methods 

Based on the results of the preliminary uncertainty 
analysis, evaluate earlier selection of instrumentation. 
Revise selection if necessary to achieve the acceptable 
uncertainty in the experiment results, 
(h) Install instrumentation 

Instrumentation should be installed in accordance with 
manufacturer's recommendations. Any deviation in the 



installation from the manufacturer's recommendations 
should be documented and the effects of the devia- 
tion on instrument performance evaluated. A change in 
instrumentation or location may be required if in-situ 
uncertainty exceeds acceptable limits determined by the 
preliminary uncertainty analysis, 
(i) Perform initial data quality verification 

To ensure that the measurements taken are not too 
uncertain and represent reality, instrument calibration 
and independent checks of the data are recommended. 
Independent checks can include sensor validation, 
energy balances, and material balances (see Sect. 3.3). 
(i) Collect data 

The challenge for data acquisition in any experiment 
is to collect the required amount of information while 
avoiding collection of superfluous information. Super- 
fluous information can overwhelm simple measures 
taken to follow the progress of an experiment and can 
complicate data analysis and report generation. The 
relationship between the desired result, either static, 
periodic stationary or transient, and time is the deter- 
mining factor for how much information is required. 
A static, non-changing result requires only the steady- 
state result and proof that all transients have died out. A 
periodic stationary result, the simplest dynamic result, 
requires information for one period and proof that the 
one selected is one of three consecutive periods with 
identical results within acceptable uncertainty. Tran- 
sient or non-repetitive results, whether a single pulse or 
a continuing, random result, require the most informa- 
tion. Regardless of the result, the dynamic characteris- 
tics of the measuring system and the full transient nature 
of the result must be documented for some relatively 
short interval of time. Identifying good models requires 
a certain amount of diversity in the data, i.e., should 
cover the spatial domain of variation of the independent 
variables (discussed in Sect. 6.2). Some basic sugges- 
tions pertinent to controlled experiments are summa- 
rized below which are also pertinent for non-intrusive 
data collection. 

(i) Range of variability: The most obvious way in 
which an experimental plan can be made compact 
and efficient is to space the variables in a predeter- 
mined manner. If a functional relationship between 
an independent variable X and a dependent vari- 
able Y is sought, the most obvious way is to select 
end points or limits of the test, thus covering the 
test envelope or domain that encloses the complete 
family of data. For a model of the type Z=f(X,Y), 
a plane area or map is formed (see Fig. 3.37). 
Functions involving more variables are usually 
broken down to a series of maps. The above dis- 
cussion relates to controllable regressor variables. 
Extraneous variables, by their very nature, cannot 
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Fig. 3.37 A possible XYZ envelope with Z as the independent vari- 
able. The dashed lines enclose the total family of points over the fea- 
sible domain space 

be varied at will. An example is phenomena driven 
by climatic variables. As an example, the energy 
use of a building is affected by outdoor dry-bulb 
temperature, humidity and solar radiation. Since 
these cannot be varied at will, a proper experimen- 
tal data collection plan would entail collecting data 
during different seasons of the year, 
(ii) Grid spacing considerations: Once the domains or 
ranges of variation of the variables are defined, the 
next step is to select the grid spacing. Being able to 
anticipate the system behavior from theory or from 
prior publications would lead to a better experi- 
mental design. For a relationship between X and Y 
which is known to be linear, the optimal grid is to 
space the points at the two extremities. However, 
if a linear relationship between X and Y is sought 
for a phenomenon which can be approximated as 
linear, then it would be best to space the x points 
evenly. 

For non-linear or polynomial functions, an equally 
spaced test sequence in X is clearly not optimal. 



Consider the pressure drop through a new fitting as 
a function of flow. It is known that the relationship 
is quadratic. Choosing an experiment with equally 
spaced X values would result in a plot such as that 
shown in Fig. 3.38a. One would have more obser- 
vations in the low pressure drop region and less in 
the higher range. One may argue that an optimal 
spacing would be to select the velocity values such 
that the pressure drop readings are more or less 
spaced (see Fig. 3.38b). Which one of two is better 
depends on the instrument precision. If the pres- 
sure drop instrument has constant relative precision 
during the entire range of variation of the experi- 
ment, then test spacing as shown in Fig. 3.38b is 
clearly better. But if the fractional uncertainty of 
the instrument decreases with increasing pres- 
sure drop values, then the point spacing sequence 
shown in Fig. 3.38a is better, 
(k) Accomplish data reduction and analysis 

Data reduction involves the distillation of raw data into 
a form that is usable for further analysis. Data reduc- 
tion may involve averaging multiple measurements, 
quantifying necessary conditions (e.g., steady state), 
comparing with physical limits or expected ranges, and 
rejecting outlying measurements. 
(1) Perform final uncertainty analysis 

A detailed final uncertainty analysis is done after the 
entire experiment has been completed and when the 
results of the experiments are to be documented or 
reported. This will take into account unknown field 
effects and variances in instrument accuracy during the 
experiment. A final uncertainty analysis involves the 
following steps: (i) Estimate fixed (bias) error based 
upon instrumentation calibration results, and (ii) docu- 
ment the random en^or due to the instrumentation based 
upon instrumentation calibration results. As pointed out 
by Coleman and Steele (1999), the fixed errors needed 
for the detailed uncertainty analysis are usually more 
difficult to estimate with a high degree of certainty. 



Fig. 3.38 Two different 
experimental designs for proper 
identification of the parameter 
(k) appearing in the model for 
pressure drop versus velocity of 
a fluid flowing through a pipe 
assuming M'=kV^. The grid 
spacing shown in (a) is the more 
common one based on equal 
increments in the regressor vari- 
able, while that in (b) is likely to 
yield more robust estimation but 
would require guess-estiinating 
the range of variation for the 
pressure drop 
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Minimizing fixed errors can be accomplished by care- 
ful calibration with referenced standards, 
(m) Reporting results 

Reporting is the primary means of communicating the 
results from an experiment. The report should be struc- 
tured to clearly explain the goals of the experiment and 
the evidence gathered to achieve the goals. It is assumed 
that data reduction, data analysis and uncertainty analy- 
sis have processed all data to render them understand- 
able by the intended audiences. Different audiences 
require different reports with various levels of detail and 
background information. In any case, all reports should 
include the results of the uncertainty analysis to an iden- 
tified confidence level (typically 95%). Uncertainty 
limits can be given as either absolute or relative (in per- 
centages). Graphical and mathematical representations 
are often used. On graphs, error bars placed vertically 
and horizontally on representative points are a very clear 
way to present expected uncertainty. A data analysis sec- 
tion and a conclusion are critical sections, and should be 
prepared with great care while being succinct and clear. 



Problems 

Pr. 3.1 Consider the data given in Table 3.2. Determine 

(a) the 10% trimmed mean value 

(b) which observations can be considered to be "mild" out- 
liers (> 1.5 xIQR) 

(c) which observations can be considered to be "extreme" 
outliers (> 3. OxIQR) 

(d) identify outliers using Chauvenet's criterion given by 
Eq. 3.19 

(e) compare the results from (b), (c) and (d). 

Pr. 3.2 Consider the data given in Table 3.6. Perform an 
exploratory data analysis involving computing pertinent sta- 
tistical summary measures, and generating pertinent graphi- 
cal plots. 

Pr. 3.3 A nuclear power facility produces a vast amount of 
heat which is usually discharged into the aquatic system. This 
heat raises the temperature of the aquatic system resulting in 
a greater concentration of chlorophyll which in turn extends 
the growing season. To study this effect, water samples were 
collected monthly at three stations for one year. Station A is 
located closest to the hot water discharge, and Station C the 
farthest (Table 3.13). 

You are asked to perform the following tasks and annotate 
with pertinent comments: 

(a) flag any outlier points 

(b) compute pertinent statistical descriptive measures 

(c) generate pertinent graphical plots 

(d) compute the correlation coefficients. 



Table 3.13 


Data table for Problem 3.3 




Month 


Station A 


Station B 


Station C 


January 


9.867 


3.723 


4.410 


February 


14.035 


8.416 


11.100 


March 


10.700 


20.723 


4.470 


April 


13.853 


9.168 


8.010 


May 


7.067 


4.778 


34.080 


June 


11.670 


9.145 


8.990 


July 


7.357 


8.463 


3.350 


August 


3.358 


4.086 


4.500 


September 


4.210 


4.233 


6.830 


October 


3.630 


2.320 


5.800 


November 


2.953 


3.843 


3.480 


December 


2.640 


3.610 


3.020 



(b) 



(c) 



Pr. 3.4 Consider Example 3.7.3 where the uncertainty analy- 
sis on chiller COP was done at full load conditions. What about 
part-load conditions, especially since there is no collected 
data? One could use data from chiller manufacturer catalogs 
for a similar type of chiller, or one could assume that part-load 
operation will affect the inlet minus the outlet chilled water 
temperatures (AT) in a proportional manner, as stated below, 
(a) Compute the 95% CL uncertainty in the COP at 70% 
and 40% full load assuming the evaporator water flow 
rate to be constant. At part load, the evaporator tempera- 
tures difference is reduced proportionately to the chiller 
load, while the electric power drawn is assumed to 
increase from a full load value of 0.8 kW/t to 1.0 kW/t 
at 70% full load and to 1.2 kW/t at 40% full load. 
Would the instrumentation be adequate or would it be 
prudent to consider better instrumentation if the frac- 
tional COP uncertainty at 95% CL should be less than 
10%. 

Note that fixed (bias) errors have been omitted from 
the analysis, and some of the assumptions in predict- 
ing part-load chiller performance can be questioned. 
A similar exercise with slight variations in some of 
the assumptions, called a sensitivity study, would be 
prudent at this stage. How would you conduct such an 
investigation? 

Pr. 3.5 Consider the uncertainty in the heat transfer coef- 
ficient illustrated in Example 3.7.1. The example was solved 
analytically using the Taylor's series approach. You are asked 
to solve the same example using the Monte Carlo method: 

(a) using 500 data points 

(b) using 1000 data points 

Compare the results from this approach with those in the 
solved example. 

Pr. 3.6 You will repeat Example 3.7.6. Instead of computing 
the standard deviation, plot the distribution of the time vari- 
able t in order to evaluate its shape. Numerically determine 
the uncertainty bands for the 95% CL. 
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Pr. 3.7 Determining cooling coil degradation based on 
effectiveness 

The thermal performance of a cooling coil can also be char- 
acterized by the concept of effectiveness widely used for 
thermal modeling of traditional heat exchangers. In such 
coils, a stream of humid air flows across a coil supplied by 
chilled water and is cooled and dehumidified as a result. In 
this case, the effectiveness can be determined as: 

actual heat transfer rate {hai — hao) 

maximum possible heat transfer rate {hat — hd) 

(3.41) 
where h and h are the enthalpies of the air stream at the 

til no A 

inlet and outlet respectively, and h is the enthalpy of enter- 
ing chilled water. 

The effectiveness is independent of the operating condi- 
tions provided the mass flow rates of air and chilled water 
remain constant. An HVAC engineer would like to determine 
whether the coil has degraded after it has been in service for 
a few years. For this purpose he assembles the following coil 
performance data at identical air and water flow rates corre- 
sponding to when originally installed (done during start-up 
commissioning) and currently (Table 3.14). 

Note that the uncertainty in determining the air enthal- 
pies are relatively large due to the uncertainty associated 
with measuring bulk air stream temperatures and humidities. 
However, the uncertainty in the enthalpy of the chilled water 
is only half of that of air. 

(a) Asses, at 95 % CL, whether the cooling coil has degraded 
or not. Clearly state any assumptions you make during 
the evaluation. 

(b) What are the relative contributions of the uncertainties 
in the three enthalpy quantities to the uncertainty in the 
effectiveness value? Do these differ from the installed 
period to the time when current tests were performed? 

Pr. 3.8' Consider a basic indirect heat exchanger where heat 
rates of the heat exchange associated with the cold and hot 
sides is given by: 



Qactuai = mc.Cpc.(Tc,o " T^j) (cold side heating) 
Qacimi = nih-CphXThj - Th,o) (hot side cooling) 



(3.42a) 



Table 3.14 Data table for Problem 3.7 








Units When 

installed 


Current 


95% 
Uncertainty 


Entering air enthalpy (h .) 


Btu/Ib 38.7 


36.8 


5% 


Leaving air enthalpy (h^^) 


Btu/hr 27.2 


28.2 


5% 


Entering water enthalpy (h,.) 


Btu/hr 23.2 


21.5 


2.5% 



Table 3.15 


Parameters and uncertainties to be assumed (Pr 3.8) 


Parameter 


Nominal value 


95% Uncertainty 


'^pc 


1 Btu/lb°F 


±5% 


"c 


475,800 Ib/h 


+10% 


T 


34°F 


±1°F 


T 

CO 


46°F 


+rF 


Chc 


0.9 Btu/hr°F 


+5% 


fflh 


450,000 Ib/h 


±10% 


\. 


55''F 


tl-F 


\o 


40°F 


±1°F 



where m, T and c are the mass flow rate, temperature and 
specific heat respectively, while the subscripts and i stand 
for outlet and inlet, and c and h denote cold and hot streams 
respectively. 

The effectiveness of the sensible heat exchanger is given 
by: 



e = 



actual heat transfer rate 
maximum possible heat transfer rate 



(3.42b) 



Assuming the values and uncertainties of various parameters 

shown in the table (Table 3.15): 

(i) compute the heat exchanger loads and the uncertainty 

ranges for the hot and cold sides 
(ii) compute uncertainty in the effectiveness determination 
(iii) what would you conclude regarding the heat balance 

checks? 

Pr. 3.9 The following table (Table 3.16) (EIA 1999) indi- 
cates the total electricity generated by five different types of 
primary energy sources as well as the total emissions associ- 
ated by each. Clearly coal and oil generate a lot of emissions 
or pollutants which are harmful not only to the environment 
but also to public health. France, on the other hand, has a mix 
of 21% coal and 79% nuclear. 



Table 3.16 Data table for Problem 3.9 



' From ASHRAE (2005) © American Society of Heating, Refrigerating 
and Air-conditioning Engineers, Inc., www.ashrae.org). 



US power generation mix and associated pollutants 


Fuel 


Electricity 
kWh (1999) 


% Total 


Short Tons (=2000 Ib/t) 






SO, 


NO^ 


CO, 


Coal 


1.77E+12 


55.7 


1.13E+07 


6.55E+06 


1.90E+09 


Oil 


8.69E+10 


2.7 


6.70E + 05 


1.23E + 05 


9.18E + 07 


Nat. Gas 


2.96E+11 


9.3 


2.00E + 03 


3.76E + 05 


1.99E+08 


Nuclear 


7.25E+11 


22.8 


O.OOE + 00 


O.OOE + 00 


O.OOE+00 


Hydro/ 
Wind 


3.00E+11 


9.4 


O.OOE + 00 


O.OOE + 00 


O.OOE + 00 


Totals 


3.18E+12 


100.0 


1.20E+07 


7.05E+06 


2.19E+09 
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Table 3.1 7 Data table for Problem 3.10 






Symbol Description 


Value 


95% 
Uncertainty 


HP Horse power of the end use device 


40 


5% 


Hours Number of operating hours in the year 


6500 


10% 


r\^ij Efficiency of the old motor 


0.85 


4% 


r\ Efficiency of the new motor 


0.92 


2% 



Mixed air 



(a) Calculate the total and percentage reductions in the 
three pollutants should the U.S. change its power gen- 
eration mix to mimic that of France (Hint: First normal- 
ize the emissions per kWh for all three pollutants) 

(b) The generation mix percentages (coal, oil, natural gas, 
nuclear and hydro/wind) have an inherent uncertainty 
of 5% at the 95% CL, while the uncertainties of the 
three pollutants are 5, 8 and 3% respectively. Assum- 
ing normal distributions for all quantities, compute 
the uncertainty of the reduction values estimated in (a) 
above. 

Pr. 3.10 Uncertainty in savings from energy conservation 
retrofits 

There is great interest in implementing retrofit measures 
meant to conserve energy in individual devices as well as 
in buildings. These measures have to justified economically, 
and including uncertainty in the estimated energy savings 
is an important element of the analysis. Consider the rather 
simple problem involving replacing an existing electric 
motor with a more energy efficient one. The annual energy 
savings E in kWh/yr are given by: 



{0.1A6).{HP).{Hours). 



1 
rioid 



1 



(3.43) 



with the symbols described in Table 3.17 along with their 

numerical values. 

(i) Determine the absolute and relative uncertainties in E 

^ ' save 

under these conditions, 
(ii) If this uncertainty had to be reduced, which variable 

will you target for further refinement? 
(iii) What is the minimum value of ?7„p„, under which the 

lower bound of the 95% CL interval is greater than zero. 

Pr. 3.11 Uncertainty in estimating outdoor air fraction in 
HVAC systems 

Ducts in heating, ventilating and air-conditioning (HVAC) 
systems supply conditioned air (SA) to the various spaces 
in a building, and also exhaust the air from these spaces, 
called return air (RA). A sketch of an all-air HVAC system is 
shown in Fig. 3.39. Occupant comfort requires that a certain 
amount of outdoor air (OA) be brought into the HVAC sys- 
tems while an equal amount of return air is exhausted to the 
outdoors. The OA and the RA mix at a point just before the 



Outdoor air 




(MA) 




To building 


(OA) 


f 


zones 




\ir-tiandler 






unit 
Return air 






(RA) 









Fig. 3.39 Sketch of an all-air HVAC system supplying conditioned air 
to indoor rooms of a building 



air-handler unit. Outdoor air ducts have dampers installed in 
order to control the OA since excess OA leads to unneces- 
sary energy wastage. One of the causes for recent complaints 
from occupants has been identified as inadequate OA, and 
sensors installed inside the ducts could modulate the damp- 
ers accordingly. Flow measurement is always problematic on 
a continuous basis. Hence, OA flow is inferred from mea- 
surements of the air temperature T^, inside the RA stream, of 
1^ inside the OA stream and T^^ inside the mixed air (MA) 
stream. The supply air is deduced by measuring the fan 
speed with a tachometer, using a differential pressure gauge 
to measure static pressure rise, and using manufacturer equa- 
tion for the fan curve. The random error of the sensors is 
0.2°F at 95% CL with negligible bias error. 

(a) From a sensible heat balance where changes in spe- 
cific heat with temperature are neglected, derive 
the following expression for the fraction of out- 
door air fraction (ratio of outdoor air and mixed air) 
OAf = {Tr - Tm)/{Tr - To) 

(b) Derive the expression for the uncertainty in OA^ and 
calculate the 95% CL in the OA^, if T^=70°F, Tq=90°F 
andT„=75°E 

Pr. 3.12 Sensor placement in HVAC ducts with consider- 
ation of flow non-uniformity 

Consider the same situation as in Pr. 3.11. Usually, the air 
ducts have large cross-sections. The problem with inferring 
outdoor air flow using temperature measurements is the 
large thermal non-uniformity usually present in these ducts 
due to both stream separation and turbulence effects. More- 
over, temperature (and, hence density) differences between 
the OA and MA streams result in poor mixing. The following 
table gives the results of a traverse in the mixed air duct with 
9 measurements (using an equally spaced grid of 3 x 3 desig- 
nated by numbers in bold in Table 3.18). The measurements 
were replicated four times under the same outdoor condi- 
tions. The random error of the sensors is 0.2°F at 95% CL 
with negligible bias error. Determine: 
(a) the worst and best grid locations for placing a single 
sensor (to be determined based on analyzing the record- 
ings at each of the 9 grid locations and for all four time 
periods) 
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Table 3.18 Table showing the temperature readings (in °F) at the nine 
different sections (S#l-S#9) of the mixed air (MA) duct (Pr. 3.12) 



55.6, 54.6, 55.8, 54.2 
S#l 


56.3, 58.5, 57.6, 63.8 

S#2 


53.7, 50.2, 59.0, 49.4 

S#3 


58.0, 62.4, 62.3, 65.8 

S#4 


66.4, 67.8, 68.7, 67.6 

S#5 


61.2, 56.3, 64.7, 58.8 

S#6 


63.5, 65.0, 63.6, 64.8 

S#7 


67 A, 61 A, 66.8, 65.7 

sm 


63.9,61.4,62.4,60.6 

S#9 



(b) the maximum and minimum errors at 95% CL one 
could expect in the average temperature across the duct 
cross-section, if the best grid location for the single sen- 
sor was adopted. 

Pr. 3.13 Uncertainty in estimated proportion of exposed 
subjects using Monte Carlo method 

Dose-response modeling is the process of characterizing 
the relation between the dose of an administered/exposed 
agent and the incidence of an adverse health effect. These 
relationships are subject to large uncertainty because of the 
paucity of data as well as the fact that they are extrapolated 
from laboratory animal tests. Haas (2002) suggested the use 
of an exponential model for mortality rate due to inhalation 
exposure by humans to anthrax spores (characterized by the 
number of colony forming units or cfu): 



p — \ — exp (— kd) 



(3.44) 



where p is the expected proportion of exposed individu- 
als likely to die, d is the average dose (in cfu) and k is the 
dose response parameter (in units of 1/cfu). A value of 
k=0.26x 10"^ has been suggested. One would like to deter- 
mine the shape and magnitude of the uncertainty distribution 
of d at p = 0.5 assuming that the one standard deviation (or 
uncertainty) of k is 30% of the above value and is normally 
distributed. Use the Monte Carlo method with 1000 trials to 
solve this problem. Also, investigate the shape of the error 
probability distribution, and ascertain the upper and lower 
95% CL. 

Pr. 3.14 Uncertainty in the estimation of biological dose 
over time for an individual 

Consider an occupant inside a building in which an acciden- 
tal biological agent has been released. The dose (D) is the 
cumulative amount of the agent to which the human body 
is subjected, while the response is the measurable physio- 
logical change produced by the agent. The widely accepted 
approach for quantifying dose is to assume functional forms 
based on first-order kinetics. For biological and radiological 
agents where the process of harm being done is cumulative, 
one can use Haber's law (Heinsohn and Cimbala 2003): 



where C{t) is the indoor concentration at a given time t, k is a 
constant which includes effects such as the occupant breath- 
ing rate, the absorption efficiency of the agent or species,. . . 
and tj and t^ are the start and end times. This relationship is 
often used to determine health-related exposure guidelines 
for toxic substances. For a simple one-zone building, the free 
response, i.e., the temporal decay is given in terms of the 
initial concentration C(tj) by: 



C(0 = C(fi).exp[(-a(f-fi)] 



(3.45b) 



D{t) 



/ 



C(t)dt 



(3.45a) 



where the model parameter "a" is a function of the volume 
of the space and the outdoor and supply air flow rates. The 
above equation is easy to integrate during any time period 
from tj to t^, thus providing a convenient means of computing 
total occupant inhaled dose when occupants enter or leave 
the contaminated zones at arbitrary times. Let a=0.017186 
with 11.7% uncertainty while C(t|)=7000 cfu/m^ (cfu-col- 
ony forming units). Assume k= 1. 

(a) Determine the total dose to which the individual is 
exposed to at the end of 15 min. 

(b) Compute the uncertainty of the total dose at 1 min time 
intervals over 15 min (similar to the approach in Exam- 
ple 3.7.6) 

(c) Plot the 95% CL over 15 min at 1 min intervals 

Pr. 3.15 Propagation of optical and tracking errors in solar 
concentrators 

Solar concentrators are optical devices meant to increase the 
incident solar radiation flux density (power per unit area) on 
a receiver. Separating the solar collection component (viz., 
the reflector) and the receiver can allow heat losses per col- 
lection area to be reduced. This would result in higher fluid 
operating temperatures at the receiver. However, there are 
several sources of errors which lead to optical losses: 
(i) Due to non-specular or diffuse reflection from the 
reflector, which could be due to improper curvature 
of the reflector surface during manufacture (shown in 
Fig. 3.40a) or to progressive dust accumulation over the 
surface over time as the system operates in the field; 
(ii) Due to tracking errors arising from improper tracking 
mechanisms as a result of improper alignment sensors or 
non-uniformity in drive mechanisms (usually, the track- 
ing is not continuous; a sensor activates a motor every 
few minutes which re-aligns the reflector to the solar 
radiation as it moves in the sky). The result is a spread 
in the reflected radiation as illustrated in Fig. 3.40b; 
(iii) Improper reflector and receiver alignment during the 
initial mounting of the structure or due to small ground/ 
pedestal settling over time). 
The above errors are characterized by root mean square 
(or rms) random errors (bias errors such as that arising from 
structural mismatch can often be corrected by one-time or 
regular corrections), and their combined effect can be deter- 
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Incoming ray 



Incident 
ray 



a 



Tracker reflector 



Fig. 3.40 Different types of optical and tracking errors, a Micro- 
roughness in solar concentrator surface leads to a spread in the reflected 
radiation. The roughness is illustrated as a dotted line for the ideal 
reflector surface and as a solid line for the actual surface, b Tracking 
errors lead to a spread in incoming solar radiation shown as a nonual 



distribution. Note that a tracker eiTor of ct^^^,^ results in a reflection error 
o = ^-^irack from Snell's law. Factor of 2 also pertains to other sources 
based on the error occuiTing as light both enters and leaves the optical 
device (see Eq. 3.46) 



mined statistically following the basic propagation of errors 
formula. Note that these errors need not be normally distrib- 
uted, but such an assumption is often made in practice. Thus, 
rms values representing the standard deviations of these 
errors are used for such types of analysis. 

The finite angular size of the solar disc results in incident 
solar rays that are not parallel but subtend an angle of about 
33 min or 9.6 mrad. 

(a) You will analyze the absolute and relative effects of this 
source of radiation spread at the receiver considering 
various other optical errors described above, and using 
the numerical values shown in Table 3.19. 

^lotalspread ^ VK^solardisk) ~r y^^manuf ) ~r K-^^dusthuild) 

~r VK'^^sensor) ~r y-^^drive) l v^rec— misalign) 1 

(3.46) 

(b) Plot the variation of the total error as a function of the 
tracker drive non-uniformity error for three discrete val- 
ues of dust building up (0, 1 and 2 mrad). 



Table 3.19 


Data table for Problem 3.15 




Component 


Source of error 


RMS error 






Fixed value 


Variation over 
time 


Solar disk 


Finite angular size 


9.6 mrad 


- 


Reflector 


Curvature manufacture 


1.0 mrad 


- 




Dust buildup 


- 


0-2 mrad 


Tracker 


Sensor mis-alignment 


2.0 mrad 


- 




Drive non-uniformity 


- 


0-10 mrad 


Receiver 


Misalignment 


2.0 mrad 


- 
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Making Statistical Inferences from Samples 



This chapter covers various concepts and methods dealing 
with statistical inference, namely point estimation, interval 
or confidence interval estimation, hypothesis testing and 
significance testing. These methods are used to infer point 
and interval estimates about a population from sample data 
using knowledge of probability and probability distribu- 
tions. Classical univariate and multivariate techniques as 
well as non-parametric and Bayesian methods are presen- 
ted. Further, various types of sampling methods are also 
described, which is followed by a discussion on estimators 
and their desirable properties. Finally, resampling methods 
are treated which, though computer intensive, are concep- 
tually simple, versatile, and allow robust point and interval 
estimation. 



Parameter tests on population estimates assume that the 
sample data are random and independently drawn. It is said 
that, in the case of finite populations, the sampling fraction 
should be smaller than about 1/lOth the population size. Fur- 
ther, the data of the random variable is assumed to be close 
to being normally distributed. There is an entire field of infe- 
rential statistics based on nonparametric or distribution- free 
tests which can be applied to population data with unknown 
probability distributions. Though nonparametric tests are un- 
encumbered by fewer restrictive assumptions, are easier to 
apply and understand, they are less efficient than parametric 
tests (in that their uncertainty intervals are larger). These are 
briefly discussed in Sect. 4.5, while Bayesian statistics, whe- 
reby one uses prior information to enhance the inference- 
making process, is addressed in Sect. 4.6. 



4.1 Introduction 

The primary reason for resorting to sampling as against mea- 
suring the whole population is to reduce expense, or to make 
quick decisions (say, in case of a production process), or of- 
ten, it is impossible to do otherwise. Random sampling, the 
most common form of sampling, involves selecting samples 
from the population in a random manner which should also 
be independent. If done correctly, it reduces or eliminates 
bias while enabling inferences to be made about the popula- 
tion from the sample. Such inferences or estimates, usually 
involving descriptive measures such as the mean value or 
the standard deviation, are called estimators. These are mat- 
hematical expressions to be applied to sample data in order 
to deduce the estimate of the true parameter. For example, 
Eqs. 3.1 and 3.7 in Chap. 3 are the estimators for deducing 
the mean and standard deviation of a data set. Unfortunately, 
certain unavoidable, or even undetected, biases may creep 
into the supposedly random sample, and this could lead to 
improper or biased inferences. This issue, as well as a more 
complete discussion of sampling and sampling design is co- 
vered in Sect. 4.7. 



4.2 Basic Univariate inferential Statistics 

4.2.1 Sampling Distribution and Confidence 
Limits of the Mean 

(a) Sampling distribution of the mean Consider a popula- 
tion from which many random samples are taken. What can 
one say about the distribution of the sample estimators? Let 
H and X be the population mean and sample mean respecti- 
vely, and a and s^ be the population standard deviation and 
sample standard deviation respectively. Then, regardless of 
the shape of the population frequency distribution; 



/x 



(4.1) 



and the standard deviation of the population mean (also re- 
ferred to as SE or standard error of the mean) 



(«) 



1/2 



(4.2) 



where s is given by Eq. 3.7 and n is the number of samples 
selected or picked. 
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In case the population sample is small and sampling is done 
without replacement, then the above standard deviation has 
to be modified to 



A^- 



(«)'/2 \n - 1 



1/2 



(4.3) 



where N is the population size. Note that if N»n, one effec- 
tively gets back Eq. 4.2. 

The sampling distribution of the mean provides an indi- 
cation of the confidence, or the degree of certainty, one can 
place about the accuracy involved in using the sample mean 
to estimate the population mean. This confidence is interpre- 
ted as a probability, and is given by the very important law 
stated below. 

The Central Limit Theorem (one of the most important 
theorems in probability) states that if a random sample of 
n observations is selected from a population with any dis- 
tribution, then the sampling distribution of x will be ap- 
proximately a Gaussian distribution when n is sufficiently 
large (n > 30). The larger the sample n, the closer does the 
sampling distribution approximate the Gaussian (Fig. 4.1)'. 
A consequence of the theorem is that it leads to a simple 
method of computing approximate probabilities of sums of 
independent random variables. It explains the remarkable 
fact that the empirical frequencies of so many natural "po- 
pulations" exhibit bell-shaped (i.e., a normal) curves. Let x^, 
\^,...\^^ be a sequence of independent identically distributed 
random variables with mean // and variance a^. Then the dis- 
tribution of the random variable z (Sect. 2.4.3) 



X — jX 

(r/-s/n 



tends to be standard normal as n tends towards infinity. Note 
that this theorem is valid for any distribution of x; herein lies 
its power. 

Probabilities for random quantities can be found by deter- 
mining areas under the standard normal curve as described 
in Sect. 2.4.3. Suppose one takes a random sample of size n 
from a population of mean // and standard deviation a. Then 
the random variable z has (i) approximately the standard nor- 
mal distribution if n> 30 regardless of the distribution of the 
population, and (ii) exactly the standard normal distribution 
if the population itself is normally distributed regardless of 
the sample size (Fig. 4.1). 

Note that when sample sizes are small (n<30) and the 
underlying distribution is unknown, the t-student distribution 



which has wider uncertainty bands (Sect. 2.4.3), should be 
used with (n- 1) degrees of freedom instead of the Gaussian 
(Fig. 2.15 and Table A4). Unlike the z-curve, there are several 
t-curves depending on the degrees of freedom (d.f). At the 
limit of infinite d.f.s, the t-curve collapses into the z-curve. 

(b) Confidence limits for tlie mean In the sub-section ab- 
ove, the behavior of many samples, all taken from one popu- 
lation, was considered. Here, only one large random sample 
from a population is selected, and analyzed so as to make an 
educated guess on properties (or estimators) of the popula- 
tion such as its mean and standard deviation. This process 
is called inductive reasoning or arguing backwards from a 
set of observations to a reasonable hypothesis. However, the 
benefit provided by having to select only a sample of the po- 
pulation comes at a price: one has to accept some uncertainty 
in our estimates. Based on a sample taken from a population: 

(a) one can deduce interval bounds of the population mean 
at a specified confidence level (this aspect is covered in 
this sub-section), and 

(b) one can test whether the sample mean differs from the 
presumed population mean (this is covered in the next 
sub-section). 

The concept of confidence intervals (CL) was introduced 
in Sect. 3.6.3 in reference to instrument errors. This concept 
pertinent to random variables in general is equally applicable 
to sampling. A 95% CL is commonly interpreted as implying 
that there is a 95% probability that the actual population esti- 
mate will lie within this confidence intervaP. The range is ob- 
tained from the z-curve by finding the value at which the area 
under the curve (i.e., the probability) is equal to 0.95. From 
(4.4) Table A3, the corresponding critical value z^^^i^ 1-96 (note 
that the critical value for a two-tailed confidence level, as in 
this case, is determined as that value of z in Table A3 which 
corresponds to a probability value of [(l-0.95)/2] = 0.025). 
This implies that the probability is: 



-1.96 < 



^A 



M 



< 1.96 



0.95 



or X — 1.96^^ < ^i < X 

Jn 



1.96 



S:c 



(4.5a) 



Thus the confidence interval of 



M 



: ;c ± z, 



c/2- 



(4.5b) 



This formula is valid for any shape of the population distribu- 
tion provided, of course, that the sample is large (say, n > 30). 



' That the sum of two Gaussian distributions from a population would 
be another Gaussian variable (a property called invariant under additi- 
on) is intuitive. Why the sum of two non-Gaussian distributions should 
gradually converge to a Gaussian is less so, and hence the importance 
of this theorem. 



- It will be pointed out in Sect. 4.6.2 that this statement can be debated, 
but this is a common interpretation and somewhat simpler to compre- 
hend than the more accurate one. 
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Fig. 4.1 Illustration of the 
important law of strong numbers. 
The sampling distribution of 
X contrasted with the parent 
population distribution for three 
cases. The first case (left column 
of figures) shows sampling from 
a normal population. As sample 
size n increases, the standard 
error of X decreases. The next 
two cases show that even though 
the populations are not normal, 
the sampling distribution still be- 
comes approximately normal as n 
increases. (From Wonnacutt and 
Wonnacutt (1985) by permission 
of John Wiley and Sons) 



Population 
n = 1 




n = 20 



The half-width of the 95% CL is ( 1 .96^ ) and is called the 

bound of the error of estimation. For small samples, instead 
of random variable z, one uses the student-t variable. 

Note that Eq. 4.5 refers to the long-run bounds, i.e., in 
the long run roughly 95% of the intervals will contain ^. If 
one is interested in predicting a single x value that has yet to 
be observed, one uses the following equation (Devore and 
Famum 2005): 



Prediction interval of x = x ± t^n ■ ^x I 1 



1/2 



(4.6) 



where t „ is the two-tailed critical value determined from the 

c/2 

t-distribution at d.f.=n- 1 at the desired confidence level. 

It is clear that the prediction intervals are much wider than 
the confidence intervals because the quantity "1" within the 
brackets of Bq. 4.6 will generally dominate (1/n). This me- 
ans that there is a lot more uncertainty in predicting the value 



of a single observation x than there is in estimating a mean 
value ^. 

Example 4.2.1 : Evaluating manufacturer- quoted lifetime of 
light bulbs from sample data 

A manufacturer of zenon light bulbs for street lighting claims 
that the distribution of the lifetimes of his best model has 
a mean /< = 16 years and a standard deviation s^ = 2 years 
when the bulbs are lit for 12 h every day. Suppose that a city 
official wants to check the claim by purchasing a sample of 
36 of these bulbs and subjecting them to tests that determine 
their lifetimes. 

(i) Assuming the manufacturer's claim to be true, descri- 
be the sampling distribution of the mean lifetime of a 
sample of 36 bulbs. Even though the shape of the dis- 
tribution is unknown, the Central Limit Theorem sug- 
gests that the normal distribution can be used. Thus 

2 
fx—x — 16 and ct = —j== — 0.33 years. 



36 
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Fig. 4.2 Sampling distribution of X for a normal distribution N(16, 
0.33). Shaded area represents the probability of the mean life of the 
bulb being < 15 years (Example 4.2.1) 

(ii) What is the probability that the sample purchased by the 
city officials has a mean-lifetime of 15 years or less? 
The normal distribution N(16, 0.33) is drawn and 
the darker shaded area to the left of x=15 as shown 
in Fig. 4.2 provides the probability of the city of- 
ficial observing a mean life of 15 years or less 
(x < 15). Next, the standard normal statistic is com- 
"-M 15-16 



puted as: z : 



-3.0 . This pro- 



cr/V« 2/V36 
bability or p- value can be read off from Table A3 as 
p(z < — 3.0)=0.0013. Consequently, the probability 
that the consumer group will observe a sample mean of 
15 or less is only 0.13%. 
(iii) If the manufacturer's claim is correct, compute the 95% 
prediction interval of a single bulb from the sample of 
36 bulbs. From the t-tables (Table A4), the critical value 
ist =1.691 ^ 1.7 ford.f. = 36-1 = 35, andCL = 95% 

c ' 

corresponding to the one-tailed distribution. Thus, 95% 

prediction interval of x= 16 ± (1.70). 2. 1 1 -| 

\ 36 

12.6 to 19.4 years. ■ 

The above example is one type of problem which can be 
addressed by one-sample statistical tests. However, the clas- 
sical hypothesis testing approach is slightly different, and is 
addressed next. 



4.2.2 Hypothesis Test for Single Sample Mean 

The previous sub-sections dealt with estimating confidence 
intervals of certain estimators of the underlying population 
from a single drawn sample. During hypothesis testing, on 
the other hand, the intent is to decide which of two com- 
peting claims is true. For example, one wishes to support 
the hypothesis that women live longer than men. Samples 



from each of the two populations are taken, and a test, cal- 
led statistical inference is performed to prove (or disprove) 
this claim. Since there is bound to be some uncertainty as- 
sociated with such a procedure, one can only be confident 
of the results to a degree that can be stated as a probability. 
If this probability value is higher than a pre-selected thres- 
hold probability, called significance level of the test, then 
one would conclude that women do live longer than men; 
otherwise, one would have to accept that the test was non- 
conclusive. 

Thus, a test of hypotheses is performed based on infor- 
mation deduced from the sample data involving its mean and 
its probability distribution, which is assumed to be close to 
a normal distribution. Once this is gathered, the following 
steps are performed: 
(i) formulate the hypotheses: the null or status quo, and the 

alternate (which are complementary) 
(ii) identify a test statistic that will be used to assess the 

evidence against the null hypothesis 
(iii) determine the probability (or p-value) that the null hy- 
pothesis can be true 
(iv) compare this value with a threshold probability corre- 
sponding to a pre-selected significance level a (say, 
0.01 or 0.05) 
(v) rule out the null hypothesis only if p-value < a , and 
accept the alternate hypothesis. 
This procedure can be applied to two sample tests as well, 
and is addressed in the subsequent sub-sections. The follo- 
wing example illustrates this procedure for single sample 
means where one would like to prove or disprove sample 
behavior from a previously held notion about the underlying 
population. 

Example 4.2.2: Evaluating whether a new lamp bulb has 
longer burning life than traditional ones 
The traditional process of light bulbs manufacture results in 
bulbs with a mean life of ^ = 1200 h and a standard deviation 
(T=300h. A new process of manufacture is developed and 
whether this is superior is to be determined. Such a problem 
involves using the classical test whereby one proceeds by 
defining two hypotheses: 

(a) The null hypothesis which represents the status quo, 
i.e., that the new process is no better than the previous 
one (unless the data provides convincing evidence to 
the contrary). In our example, the null hypothesis is 
H„:^ = 1200h, 

(b) The research or alternative hypothesis (H ) is the pre- 
mise that ^ 7^ 1200 h. 

Assume a sample size of n=100 of bulbs manufactured 
by the new process, and set the significance or error level of 
the test to be a = 0.05 assuming a one-tailed test (since the 
new bulb manufacturing process should have a longer life, 
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not just different from that of the traditional process). The 
mean life x of the sample of 100 bulbs can be assumed to be 
normally distributed with mean 1 200 and standard deviation 
a/-Jn — 300/VlOO — 30. From the standard normal table 
(Table A3), the critical z-value is: Zo,=o.05 — 1-64 . Recalling 



that the critical value is defined as: z^ 



Mo 

-j^, leads to 



X, = 1200 + 1.64 X 300/(100)"' = 1249 or about 1250. 

Suppose testing of the 100 tubes yields a value of 
X = 1260. As X > Xc, one would reject the null hypothesis at 
the 0.05 significance (or error) level. This is akin to jury tri- 
als where the null hypothesis is taken to be that the accused 
is innocent, and the burden of proof during hypothesis testing 
is on the alternate hypothesis, i.e., on the prosecutor to show 
overwhelming evidence of the culpability of the accused. If 
such overwhelming evidence is absent, the null hypothesis is 
preferentially favored. ■ 

There is another way of looking at this testing procedure 
(Devore and Farnum 2005): 

(a) Hjj is true, but one has been exceedingly unlucky and 
got a very improbable sample with mean x. In other 
words, the observed difference turned out to be signi- 



ficant when, in fact, there is no real difference. Thus, 
the null hypothesis has been rejected erroneously. The 
innocent man has been falsely convicted; 
(b) Hg is not true after all. Thus, it is no surprise that the ob- 
served X value was so high, or that the accused is indeed 
culpable. 
The second explanation is likely to be more plausible, but 
there is always some doubt because statistical decisions in- 
herently contain probabilistic elements. In other words, sta- 
tistical tests of hypothesis do not always yield conclusions 
with absolute certainty: they have in-built margins of error 
just like jury trials are known to hand down wrong verdicts. 
Hence, two types of errors can be distinguished: 
(i) Concluding that the null hypothesis is false, when in 
fact it is true, is called a Type I error, and represents 
the probability a (i.e., the pre-selected significance le- 
vel) of erroneously rejecting the null hypothesis. This 
is also called the "false negative" or "false alarm" rate. 
The upper normal distribution shown in Fig. 4.3 has a 
mean value of 1 200 (equal to the population or claimed 
mean value) with a standard deviation of 30. The area 
to the right of the critical value of 1250 represents the 
probability of Type I error occurring. 



Fig. 4.3 The two kinds of error 
that occur in a classical test, a If 
Hjj is true, then significance level 
a =probability of erring (rejec- 
ting the true hypothesis H^). b If 
H is true, then /3 = probability 
of erring (judging that the false 
hypothesis H^ is acceptable). The 
numerical values correspond to 
data from Example 4.2.2 



Accept Ho 



(X 0.001) 




1100 1150 1200 
(X 0.001) 



1350 



Area represents 
probability of falsely 
accepting the alternative 
hypothesis (Type II error) 



1200 



-> Reject Ho 



Area represents 
probability of falsely 
rejecting null hypothesis 
(Type I error) 



1300 




1250 



Critical value 



1300 



1350 



1400 
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(ii) The flip side, i.e. concluding that the null hypothesis is 
true, when in fact it is false, is called a Type II en^or and 
represents the probability /? of erroneously accepting 
the alternate hypothesis, also called the 'Jalse positive" 
rate. The lower plot of the normal distribution shown in 
Fig. 4.3 now has a mean of 1260 (the mean value of the 
sample) with a standard deviation of 30, while the area 
to the left of the critical value x indicates the probabili- 
ty P of being in error of Type II. 
The two types of error are inversely related as is clear 
from the vertical line in Fig. 4.3 drawn through both figures. 
A decrease in probability of one type of error is likely to 
result in an increase in the probability of the other. Unfortu- 
nately, one cannot simultaneously reduce both by selecting a 
smaller value of a. The analyst would select the significance 
level depending on the tolerance, or seriousness of the con- 
sequences of either type of error specific to the circumstance. 
Recall that the probability of making a type I error is called 
the significance level of the test. This probability of correctly 
rejecting the null hypothesis is also referred to as the statis- 
tical power. The only way of reducing both types of errors is 
to increase the sample size with the expectation that the stan- 
dard deviation would decrease and the sample mean would 
get closer to the population mean. 

An important concept needs to be clarified, namely when 
does one use one-tailed as against two-tailed tests. In the 
two-tailed test, one is testing whether the sample is different 
(i.e., smaller or larger) than the stipulated population. In ca- 
ses where one wishes to test whether the sample is specifical- 
ly larger (or specifically smaller) than the stipulated popula- 
tion, then the one tailed test is used (as in Examples 4.2. 1 and 
4.2.2). The tests are set up and addressed in like manner, the 
difference being in how the p-level is finally determined. The 
shaded areas of the normal distributions shown in Fig. 4.4 
illustrate the difference in both types of tests assuming a sig- 
nificance level corresponding to p = 0.05 for the two-tailed 
test and half the probability value (or p= 0.025) for the one- 
tailed test. 

One final issue relates to the selection of the test sta- 
tistic. One needs to distinguish between the following two 
instances: 



(i) if the population variance a is known and for sample si- 
zes n>30, then the z statistic is selected for performing 
the test along with the standard normal tables (as done 
for Example 4.2.2 above); 

(ii) if the population variance is unknown or if the samp- 
le size n<30, then the t-statistic is selected (using the 
sample standard deviation s instead of a) for performing 
the test using Student-t tables with the appropriate de- 
gree of freedom. 



4.2.3 Two Independent Sample and Paired 
Difference Tests on Means 

As opposed to hypothesis tests for a single population mean, 
there are hypothesis tests that allow one to compare values of 
two population means from samples taken from each popula- 
tion. Two basic presumptions for the tests (described below) 
to be valid are that the standard deviations of the populations 
are reasonably close, and that the populations are approxi- 
mately normally distributed. 

(a) Two independent sample test The test is based on the 
information (namely, the mean and the standard deviation) 
obtained from taking two independent random samples from 
the two populations under consideration whose variances 
are unknown and unequal (but reasonably close). Using the 
same notation as before for population and sample and using 
subscripts 1 and 2 to denote the two samples, the random 
variable 



(xi - X2) - (}i\ - ^ll) 



V"l "2/ 



1/2 



(4.7) 



is said to approximate the standard normal distribution for 
large samples (nj>30 and n2>30) where Sj and s, are the 
standard deviations of the two samples. The denominator 
is called the standard error (SE) and is a measure of the 
total variability of both samples combined (remember that 
variances of quantities which are independent add in qua- 
drature). 



Fig. 4.4 Illustration of critical 
cutoff values between one tailed 
and two-tailed tests assuming the 
normal distribution. The shaded 
areas represent the probability 
values corresponding to 95% 
CL or 0.05 significance level 
or p = 0.05. The critical values 
shown can be determined from 
Table A3 



m 




-1 .645 y. 

One-tailed test 



-1 .96 ^ 1 .96 

Two-tailed test 
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The confidence intervals of the difference in the popula- 
tion means can be determined as: 



fJ-i — f^2 — (xi — X2) ±Zc ■ SE (xi,X2) 

where SE(x\,X2) = ( — + — 
\ni n2 



1/2 



(4.8) 



where z is the critical value at the selected significance le- 
vel. Thus, the testing of the two samples involves a single 
random variable combining the properties of both. 

For smaller sample sizes, Eq. 4.8 still applies, but the z 
standardized variable is replaced with the student-t variable. 
The critical values are found from the student t-tables with 
degrees of freedom d.f. = njH-n,-2. If the variances of the 
population are known, then these should be used instead of 
the sample variances. 

Some textbooks suggest the use of ''pooled variances" 
when the samples are small and the variances of both popu- 
lations are close. Here, instead of using individual standard 
deviation values s^ and s^, a new quantity called the pooled 
variance s is used: 



i- 



(ni 



l)s? 



ni + n2 



with d.r. — ni 



n2 - 2 (4.9) 



Note that the pooled variance is simply the weighted average 
of the two sample variances. The use of the pooled variance 
approach is said to result in tighter confidence intervals, and 
hence its appeal. The random variable approximates the t- 
distribution, and the confidence intervals of the difference in 
the population means are: 



Ml - M2 = {xi 
where SE{x\,X2) — 



■ X2)±tc ■ SE{xi,X2) 



1 
n2 



1/2 



(4.10) 



Devore and Farnum (2005) strongly discourage the use of 
the pooled variance approach as a general rule, and so the 
better approach, when in doubt, is to use Eq. 4.8 so as to be 
conservative. 

Figure 4.5 illustrates, in a simple conceptual manner, the 
four characteristic cases which can arise when comparing 
the means of two populations based on sampled data. Recall 
that the box and whisker plot is a type of graphical display 
of the shape of the distribution where the solid line denotes 
the median, the upper and lower hinges of the box indica- 
te the interquartile range values (25th and 75th percentiles) 
with the whiskers extending to 1 .5 times this range. Case (a) 
corresponds to the case where the two whisker bands do not 
overlap, and one could state with confidence that the two 
population means are very likely to be different at the 95% 
confidence level. Case (b) also suggests difference between 
population means, but will a little less certitude. Case (d) 
illustrates the case where the two whisker bands are practi- 
cally identical, and so the population means are very likely 
to be statistically similar. It is when cases as illustrated in 
frames (b) and (c) occur that the value of statistical tests be- 
comes apparent. As a rough thumb rule, if the 25th percentile 
for one sample exceeds the median line of the other sample, 
one could conclude that the mean are likely to be different 
(Walpole et al. 2007). 

Manly (2005) states that the independent random sample 
test is fairly robust to the assumptions of normality and equal 
population variance especially when the sample size exceeds 
20 or so. The assumption of equal population variances is 
said not to be an issue if the ratio of the two variances is 
within 0.4 to 2.5. 

Example 4.2.3: Verifying savings from energy conservation 
measures in homes 

Certain electric utilities with limited generation capacities 
fund contractors to weather strip residences in an effort to 



Fig. 4.5 Conceptual illustration 
of four ctiaracteristic cases tliat 
may arise during two-sample 
testing of medians. The box 
and whisker plots provide some 
indication as to the variability in 
the results of the tests. Case 

(a) clearly indicates that the 
samples are very much different, 
while the opposite applies to case 
(d). However, it is more difficult 
to draw conclusions from cases 

(b) and (c), and it is in such cases 
that statistical tests are useful 
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reduce infiltration losses which lower electricity needs\ 
Suppose an electric utility wishes to determine the cost-ef- 
fectiveness of their weather-stripping program by comparing 
the annual electric energy use of 200 similar residences in a 
given community, half of which were weather-stripped, and 
the other half were not. Samples collected from both types 
of residences yield: 

Control sample: xi =18,750; S| = 3,200 andnj = 100. 
Weather-stripped sample: X2 = 15,150; S2=2,700 and n2= 100. 

The mean difference (ii - ia) =18,750-15,150 = 3,600, 
i.e., the mean saving in each weather-stripped residence 
is 19.2% (=3,600/18,750) of the mean basehne or control 
home. However, there is an uncertainty associated with this 
mean value since only a sample has been analyzed. This un- 
certainty is characterized as a bounded range for the mean 
difference. At the 95% CL, corresponding to a significance 
level a = 0.05 for a one-tailed distribution, z = 1.645 from Ta- 

c 

ble A3, and from Eq. 4.8: 



/xi - At2 = (18,750 - 15,150) 



±1.6451^ 
VlOO 



100 



1/2 



3,600 ±1.645 



= 3,600 ±689 = (2,9 11 



To complete the calculation of the confidence interval, it 
is assumed, given that the sample sizes are large, that the 
sample variances are reasonably close to the population 
variances. Thus, our confidence interval is approximately: 
'3,200^ 2,700^^ '^^ 
100 "^ 100 

and 4,289). These intervals represent the lower and upper 
values of saved energy at the 95% CL. To conclude, one can 
state that the savings are positive, i.e., one can be 95% con- 
fident that there is an energy benefit in weather-striping the 
homes. More specifically, the mean saving is 19.2% of the 
baseline value with an uncertainty of 19.1% (=689/3,600) 
in the savings at the 95% CL. Thus, the uncertainty in the 
savings estimate is as large as the estimate itself which casts 
doubt on the efficacy of the conservation program. Increa- 
sing the sample size or resorting to stratified sampling are 
obvious options and are discussed in Sect. 4.7. Another op- 
tion is to adopt a less stringent confidence level; 90% CL is 
commonly adopted. This example reflects a realistic concern 
in that energy savings in homes from energy conservation 
measures are often difficult to verify accurately. ■ 

(b) Paired difference test The previous section dealt with 
independent samples from two populations with close to nor- 
mal probability distributions. There are instances when the 
samples are somewhat correlated, and such interdependent 



samples are called paired samples. This interdependence can 
also arise when the samples are taken at the same time, and 
are affected by a time-varying variable which is not expli- 
citly considered in the analysis. Rather than the individual 
values, the difference is taken as the only random sample 
since it is likely to exhibit much less variation than those of 
the two samples. Thus, the confidence intervals calculated 
from paired data will be narrower than those calculated from 
two independent samples. Let di be the difference between 
individual readings of two small paired samples (n < 30), and 
d their mean value. Then, the t-statistic is taken to be: 



t = d/SE where SE — {sdl-Jn) 



(4.11a) 



^ This is considered more cost effective to utilities in terms of deferred 
capacity expansion costs than the resulting revenue loss in electricity 
sales due to such conservation measures. 



and the confidence interval around d is: 

lid^d±t,{s,l^) (4- lib) 

Hypothesis testing of means for paired samples is done the 
same way as that for a single independent mean, and is usu- 
ally (but not always) superior to an independent sample test. 
Paired difference tests are used for comparing "before and 
after" or "with and without" type of experiments done on 
the same group in turn, say, to assess effect of an action per- 
formed. For example, the effect of an additive in gasoline 
meant to improve gas mileage can be evaluated statistically 
by considering a set of data representing the difference in the 
gas mileage of n cars which have each been subjected to tests 
involving "no additive" and "with additive". Its usefulness is 
illustrated by the following example which is another type of 
application for which paired difference tests can be used. 

Example 4.2.4: Comparing energy use of two similar build- 
ings based on utility bills — the wrong way 
Buildings which are designed according to certain performan- 
ce standards are eligible for recognition as energy-efficient 
buildings by federal and certification agencies. A recently 
completed building (B2) was awarded such an honor The fe- 
deral inspector, however, denied the request of another owner 
of an identical building (Bl) close by who claimed that the 
differences in energy use between both buildings were within 
statistical error. An energy consultant was hired by the owner 
to prove that B 1 is as energy efficient as B2. He chose to com- 
pare the monthly mean utility bills over a year between the 
two commercial buildings based on the data recorded over the 
same 12 months and listed in Table 4.1. This problem can be 
addressed using the two sample test method described earlier. 
The null hypothesis is that the mean monthly utility char- 
ges /ii and /X2 for the two buildings are equal against the al- 
ternative hypothesis that they differ. Since the sample sizes are 
less than 30, the t-statistic has to be used instead of the stan- 
dard normal z statistic. The pooled variance approach given 
by Eq. 4.9 is appropriate in this instance. It is computed as: 
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Table 4.1 Montlily utility bills and the corresponding 
rature for the two buildings being compared-Example 


outdoor tempe- 
4.2.4 


Month 


Building Bl 
Utility cost 
($) 


Building B2 
Utility cost 
($) 


Difference 
in Costs 
(B1-B2) 


Outdoor 

temperature 

(°C) 


1 


693 




639 


54 


3.5 


2 


759 




678 


81 


4.7 


3 


1005 




918 


87 


9.2 


4 


1074 




999 


75 


10.4 


5 


1449 




1302 


147 


17.3 


6 


1932 




1827 


105 


26 


7 


2106 




2049 


57 


29.2 


8 


2073 




1971 


102 


28.6 


9 


1905 




1782 


123 


25.5 


10 


1338 




1281 


57 


15.2 


11 


981 




933 


48 


8.7 


12 


873 




825 


48 


6.8 


Mean 


1,349 




1,267 


82 




Std. 
Deviation 


530.07 




516.03 


32.00 





2 (12 - 1) • (530.07)^ + (12 - 1) ■ (516.03)2 



^^ 



12 + 12-2 



= 273,630.6 



while the t-statistic can be deduced from Eq. 4.10 and is gi- 
ven by 

(1349 - 1267) - 



(273,630.6) (1 + 1 

82 

= 0.38 



1/2 



213.54 
for d.f. = 12 + 12 - 2 = 22 



The t- value is very small, and will not lead to the rejection of 
the null hypothesis even at significance level a=0.02 (from 
Table A4, the one-tailed critical value is 1.321 for CL=90% 
and d.f. = 22). Thus, the consultant would report that insuffi- 
cient statistical evidence exists to state that the two buildings 
are different in their energy consumption. 

Example 4.2.5: Comparing energy use of two similar build- 
ings based on utility bills — the right way 
There is, however, a problem with the way the energy con- 
sultant performed the test. Close observation of the data as 
plotted in Fig. 4.6 would lead one not only to suspect that 
this conclusion is erroneous, but also to observe that the uti- 
lity bills of the two buildings tend to rise and fall together 
because of seasonal variations in the outdoor temperature. 
Hence the condition that the two samples are independent 
is violated. It is in such circumstances that a paired test is 
relevant. Here, the test is meant to determine whether the 
monthly mean of the differences in utility charges between 
both buildings (io) is zero or not. The null hypothesis is 
that this is zero, while the alternate hypothesis is that it is 
different from zero. Thus: 



t-statistic — 



with d.f. 



Xd - _ 82 

So/yiiD ~ 32/712 
12- 1 = 11 



where the values of 82 and 32 are found from Table 4. 1 . 

For a significance level of 0.05 and using a one-tailed 
test. Table A4 suggests a critical value 1^^^=1.796. Because 
8.88 is much higher than this critical value, one can safely 
reject the null hypothesis. In fact, Bldg 1 is less energy effi- 
cient than Bldg 2 even at a significance level of 0.0005 (or 
CL= 99.95%), and the owner of Bl does not have a valid case 
at all ! This illustrates how misleading results can be obtained 



Fig. 4.6 Illustrating variation of 
the utility bills for the two build- 
ings Bl and B2 (Example 4.2.5) 
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4 Making Statistical Inferences from Samples 



if inferential tests are misused, or if the analyst ignores the 
underlying assumptions behind a particular test. 



4.2.4 Single and Two Sample Tests for 
Proportions 



Example 4.2.7: The same equations can also be used to de- 
termine sample size in order for p not to exceed a certain 
range or error e. For instance, one would like to determine 
from Example 4.6 data, the sample size which will yield an 
estimate of p within 0.02 or less at 95% CL 

Then, recasting Eq. 4.13 results in a sample size: 



There are several cases where surveys are performed in or- 
der to determine fractions or proportions of populations who 
either have preferences of some sort or have a certain type 
of equipment. For example, the gas company may wish to 
determine what fraction of their customer base has gas hea- 
ting as against oil heat or electric heat pumps. The company 
performs a survey on a random sample from which it would 
like to extrapolate and ascertain confidence limits on this 
fraction. It is in such cases which can be interpreted as either 
a "success" (the customer has gas heat) or a "failure" — in 
short, a binomial experiment (see Sect. 2.4.2b) — that the fol- 
lowing test is useful. 



z^a/2P(l - P) 



(1.96O(0.63)(l -0.63) 



(0.02)^ 



2239 



It must be pointed out that the above example is somewhat 
misleading since one does not know the value of p before- 
hand. One may have a preliminary idea, in which case, the 
sample size n would be an approximate estimate and this 
may have to be revised once some data is collected. 



(a) Single sample test Let p be the population proportion one 

wishes to estimate from the sample proportion p which can 

, , . , ^ number of successes in sample x 

be determined as: p — — — — . 

total number of trials n 

Then, provided the sample is large (n > 30), proportion p 
is an unbiased estimator of p with approximately normal dis- 
tribution. Dividing the expression for standard deviation of 
the Bernoulli trials (Eq. 2.33b) by "n^", yields the standard 
deviation of the sampling distribution of ^: 



[p{\ -p) /n\ 



1/2 



(4.12) 



(b) Two sample tests The intent here is to estimate whether 
statistically significant differences exist between proportions 
of two populations based on one sample drawn from each 
population. Assume that the two samples are large and inde- 
pendent. Let p\ and p2 be the sampling proportions. Then, 
the sampling distribution of (Pi—pi) is approximately nor- 
mal with (pi—pi) being an unbiased estimator of (pi — pi) 
and the standard deviation given by: 



Pl(l -^l) , P2(l - Pi) 



n\ 



ni 



1/2 



(4.14) 



Thus, the large sample confidence interval for p for the two 
tailed case at a significance level z is given by: 



p±Zo,f2[p{l - p)/n] 



1/2 



(4.13) 



Example 4.2.6: In a random sample of n=1000 new resi- 
dences in Scottsdale, AZ, it was found that 630 had swim- 
ming pools. Find the 95% confidence interval for the fraction 
of buildings which have pools. 



In this case, n= 1000, while p — 



630 
1000 



— 0.63. From Ta- 



ble A3, the one-tailed critical value zo.025 = 1.96, and hence 
from Eq. 4.13, the two tailed 95% confidence interval for p 



is: 



0.63 - 1.96 



0.63(1 -0.63)' 



0.63 + 1.96 



100 
0.63(1 - 



1/2 



<P< 



0.63) 



ni/2 



100 



or 0.5354 <p< 0.7246. 



The following example illustrates the procedure. 

Example 4.2.8: Hypothesis testing of increased incidence 
of lung ailments due to radon in homes 
The Environmental Protection Agency (EPA) would like to 
determine whether the fraction of residents with health pro- 
blems living in an area known to have high radon concentra- 
tions is statistically different from one where levels of radon 
are negligible. Specifically, it wishes to test the hypothesis at 
the 95% CL that the fraction of residents with lung ailments 
in radon prone areas is higher than one with low radon levels. 
The following data is collected: 

High radon level area: m — 100, pi =0.38 
Low radon area: m2 — 225, p2 — 0.22 

Then 

null hypothesis Hq : {pi — pi) — 
alternative hypothesis H\ : {p\ — pi) ^ 
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One calculates the random variable 
(Pi - Pi) 



J3i(l - Pi) Pijl - PiV ^'^ 

(0.38 - 0.22) 
(0.38)(0.62) (0.22X0.78)1 '^^ 



2.865 



100 



225 



A one-tailed test is appropriate, and from Table A3 the criti- 
cal value of zo.05 = 1.65 for the 95% CL. Since the calcu- 
lated z value >z^, this would suggest that the null hypothesis 
can be rejected. Thus, one would conclude that those living 
in areas of high radon levels have statistically higher lung ail- 
ments than those who do not. Further inspection of Table A3 
reveals that z =2.865 corresponds to a probability value of 
0.021 or close to 98% CL. Should the EPA require mandato- 
ry testing of all homes at some expense to all homeowners or 
should some other policy measure be adopted? These types 
of considerations fall under the purview of decision making 
discussed in Chap. 12. ■ 



4.2.5 Single and Two Sample Tests of Variance 

Recall that when a sample mean is used to provide an estima- 
te of the population mean n, it is more informative to give a 
confidence interval for // instead of simply stating the value 
x. A similar approach can be adopted for estimating the po- 
pulation variance from that of a sample. 

(a) Single sample test The confidence intervals for a po- 
pulation variance a^ based on sample variance s- are to be 
determined. To construct such confidence intervals, one will 
use the fact that if a random sample of size n is taken from 
a population that is normally distributed with variance a^ , 
then the random variable 



1 



X = 



(4.15) 



has the chi-square distribution with v =(n-l) degrees of 
freedom (described in Sect. 2.4.3). The advantage of using 

2 

X instead of s- is similar to the advantage of standardizing 
a variable to a normal random variable. Such a transforma- 
tion allows standard tables (such as Table A5) to be used for 
determining probabilities irrespective of the magnitude of s'. 
The basis of these probability tables is again akin to finding 
the areas under the chi-square curves. 

Example 4.2.9: A company which makes boxes wishes to 
determine whether their automated production line requi- 



res major servicing or not. They will base their decision on 
whether the weight from one box to another is significantly 
different from a maximum permissible population variance 
value of (7^ = 0.12 kg^. A sample of 10 boxes is selected, and 
their variance is found to be s^ = 0.24 kg'. Is this difference 
significant at the 95% CL? 

From Eq. 4.15, the observed chi-square value is 

~ (0.24) = 18 • Inspection of Table A5 for v =9 



0.12 



degrees of freedom, reveals that for a significance level 

2 

a — 0.05 , the critical chi-square value X c= 16.92 and, for 
a = 0.025 , X c = 19.02. Thus, the result is significant at 
a = 0.05 or 95% CL. However, the result is not significant 
at the 97.5% CL. Whether to service the automated produc- 
tion line based on these statistical tests involves performing 
a decision analysis. ■ 



(b) Two sample tests This instance applies to the case when 
two independent random samples are taken from two popu- 
lations that are normally distributed, and one needs to de- 
termine whether the variances of the two populations are dif- 
ferent or not. Such tests find applications prior to conducting 
t-tests on two means which presumes equal variances. Let a^ 
and a^ be the standard deviations of both the populations, and 
Sj and s, be the sample standard deviations. If Cj 
random variable 



: a^, then the 



F = 



(4.16) 



has the F-distribution (described in Sect. 2.4.3) with degrees 



of freedom (d.f.) = (V|, v^) where v^ = {n^ — \) and v, = (n,-l). 



Note that the distributions are different for different combi- 
nations of Vj and v,. The probabilities for F can be determi- 
ned using areas under the F curves or from tabulated values 
(Table A6). Note that the F-test applies to independent sam- 
ples, and, unfortunately, is known to be rather sensitive to the 
assumption of normality. Hence, some argue against its use 
altogether for two sample testing (Manly 2005). 

Example 4.2.10: Comparing variability in daily producti- 
vity of two workers 

It is generally acknowledged that worker productivity in- 
creases if his environment is conditioned so as to meet the 
stipulated human comfort conditions. One is interested in 
comparing the mean productivity of two office workers. Ho- 
wever, before undertaking that evaluation, one is unsure ab- 
out the assumption of equal variances in productivity of the 
workers (i.e., in how consistent the workers are from one day 
to another). This test can be used to check the validity of this 
assumption. Suppose the following data has been collected 
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F distribution witli d.f. (17,12) 




1 


- 











Critical value = 2.38 for 


■ 


0.8 


- r\ 


3. = 0.05 


- 




/ \ 


Rejection region 


- 


>,0.6 


/ \ 




- 


c 

"° 0.4 


:- / \ 


Calculated 


- 




I \ 






■ 




/ \ 






0.2 



" / 


\ 


^ 


■ 


~ 1 .... ' 


: — 1 h 


..,....,...., 1 



X 



Fig. 4.7 Since the calculated F value is lower than the critical value, 
one is forced to accept the null hypothesis (Example 4.2.10) 



for two workers under the same environment and performing 
similar tasks. An initial analysis of the data suggests that the 
normality condition is met for both workers: 

Worker A: n| = 13 days, meanxi =26.3 production units, 
standard deviation Sj = 8.2 production units. 

Worker B: n2=18 days, meanxa =19.7 production units, 
standard deviation S2=6.0 production units. 

The intent here is to compare not the means but the 
standard deviations. The F-statistic is determined by al- 
ways choosing the larger variance as the numerator. Then 
F= (8.2/6.0)^= 1.87. From Table A6, the critical F value 
F^ = 2.38 for (13-1)= 12 and (18-1) = 17 degrees of free- 
dom at a significance level a = 0.05 . Thus, as illustrated in 
Fig. 4.7, one is forced to accept the null hypothesis, and con- 
clude that the data provides not enough evidence to indicate 
that the population variances of the two workers are statisti- 
cally different at « = 0.05 • Hence, one can now proceed to 
use the two-sample t-test with some confidence to determine 
whether the difference in the means between both workers is 
statistically significant or not. 



4.2.6 Tests for Distributions 

The Chi-square (x^) statistic applies to discrete data. It is 
used to statistically test the hypothesis that a set of empi- 
rical or sample data does not differ significantly from that 
which would be expected from some specified theoretical 
distribution. In other words, it is a goodness-of-fit test to 
ascertain whether the distribution of proportions of one 
group differs from another or not. The chi-square statistic 
is computed as: 



where f^|^_. is the observed frequency of each class or interval, 
f is the expected frequency for each class predicted by the 
theoretical distribution, and k is the number of classes or inter- 
vals. If / =0, then the observed and theoretical frequencies 
agree exactly. If not, the larger the value of X , the greater 
the discrepancy. Tabulated values of X are used to determine 
significance for different values of degrees of freedom « = k - 1 
(see Table A5). Certain restrictions apply for proper use of this 
test. The sample size should be greater than 30, and none of 
the expected frequencies should be less than 5 (Walpole et al. 
2007). In other words, a long tail of the probability curve at the 
lower end is not appropriate. The following example serves to 
illustrate the process of applying the chi-square test. 

Example 4.2.11 : Ascertaining whether non-code complian- 
ce infringements in residences is random or not 
A county official was asked to analyze the frequency of cases 
when home inspectors found new homes built by one speci- 
fic builder to be non-code compliant, and determine whether 
the violations were random or not. The following data for 
380 homes were collected: 



No. of code infringements 





1 


2 


3 


4 


Number of homes 


242 


94 


38 


4 


2 



The underlying random process can be characterized by the 

Poisson distribution (see Sect. 2.4.2): P{x)— — ^^^ -■ 

x\ 

The null hypothesis, namely that the sample is drawn from 

a population that is Poisson distributed, is to be tested at the 

0.05 significance level. 

Thesamplemean X = 0(242) + 1(94) + 2(38) + 3(4) + 4(2) 

380 

= 0.5 infringements per home 

For a Poisson distribution with 1=0.5, the underlying or 
expected values are found for different values of x as shown 
in Table 4.2. 

The last three categories have expected frequencies that 
are less than 5, which do not meet one of the requirements 

Table 4.2 Expected number of homes for different number of non- 
code compliance values if the process is assumed to be a Poisson dis- 
tribution with sample mean of 0.5 



2 _ V~^ (.Jobs /exp) 



/e 



(4.17) 



X= number 
compliance 


of non- 
values 


code 


P(x)-n 


Expected no 







(0.6065) -380 


230.470 


1 






(0.3033) -380 


115.254 


2 






(0.0758) -380 


28.804 


3 






(0.0126) -380 


4.788 


4 






(0.0016) -380 


0.608 


5 or more 






(0.0002) -380 


0.076 


Total 






(1.000) -380 


380 



exp 
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for using the test (as stated above). Hence, these will be com- 
bined into a new category called "3 or more cases" which 
will have an expected frequency of 4.7888 + 0.608 + 0.076 = 
5.472. The following statistic is calculated first: 

2 (242-230.470)^ (94-115.254)^ 



230.470 



+ 



115.254 



(38 - 28.804)^ (6 - 5.472) 



28.804 



+ 



5.472 



7.483 



Since there are only 4 groups, the degrees of freedom 
V =4-1=3, and from Table A5, the critical value at 0.05 
significance level is / critical =7.815. Hence, the null hypo- 
thesis cannot be rejected at the 0.05 significance level; this 
is, however, marginal. ■ 

Example 4.2.12'*: Evaluating whether injuries in males and 
females is independent of circumstance 

Chi-square tests are also widely used as tests of indepen- 
dence using contingency tables. In 1975, more than 59 mil- 
lion Americans suffered injuries. More males (33.6 million) 
were injured than females (25.6 million). These statistics do 
not distinguish whether males and females tend to be injured 
in similar circumstances. A safety survey of n= 183 accident 
reports were selected at random to study this issue in a large 
city, as summarized in Table 4.3. 

The null hypothesis is that the circumstance of an acci- 
dent (whether at work or at home) is independent of the gen- 
der of the victim. It is decided to check this hypothesis at a 
significance level of a = 0.01. The degrees of freedom d.f. = 
(r-l)(c-l) where r is the number of rows and c the number 
of categories. Hence, d.f. = (3-1) (2-1) = 2. From Table A5, 
the critical value is / c=9.21 at a =0.01 for d.f. = 2. 

The expected values for different joint occurrences (male/ 
work, male/home, male/other, female/work, female/home, 
female/other) are shown in italics in the table and corre- 
spond to the case when the occurrences are really indepen- 
dent. Recall from basic probability (Eq. 2.10) that if events A 
and B are independent, then p(A n B) — p{A).p(B) where 
p indicates the probability. In our case, if being male and 
being involved in an accident at work were truly indepen- 

Table 4.3 Observed and computed (assuming gender independence) 
number of accidents in different circumstances 
Male 



Female Total 

Circums- Observed Expected Observed Expected Observed 
tance 



Other 
Total 



k 40 


26.3 


5 


18.7 


45 


le 49 


62.6 


58 


44.4 


107 


18 


18.1 


13 


12.9 


31 



107 



76 



183=n 



dent, then p(work n male) — p{work).p{male) . Consider 

the cell corresponding to male/at work. Its expected value = 

45 107 
n ■ p{work n male) = n ■ p{work) ■ p(male) =183- 



(45) ■ (107) 
183 



183 183 
= 26.3 (^s shown in the table). Expected valu- 



es for other joint occurrences shown in the table have been 
computed in like manner. 

Thus, the chi-square statistics is x^ = ^^° " ^^'^^ 

26.3 

(5-18.7)2^ ^(13^2^ = 24 3 



18.7 



12.9 



From Weiss (1987) by © permission of Pearson Education. 



Since, X c <24.3, the null hypothesis can be safely rejec- 
ted at a significance level of 0.01. Hence, the gender does have 
a bearing on the circumstance in which the accidents occur. ■ 



4.2.7 Test on the Pearson Correlation 
Coefficient 

Recall that the Pearson correlation coefficient was presented 
in Sect. 3.4.2 as a means of quantifying the linear relations- 
hip between samples of two variables. One can also defi- 
ne a population correlation coefficient p for two variables. 
Section 4.2.1 presented methods by which the uncertainty 
around the population mean could be ascertained from the 
sample mean by determining confidence limits. Similarly, 
one can make inferences about the population correlation 
coefficient/) from knowledge of the sample correlation coef- 
ficient r. Provided both the variables are normally distributed 
(called a bivariate normal population), then Fig. 4.8 provides 
a convenient way of ascertaining the 95% CL of the popu- 
lation correlation coefficient for different sample sizes. Say, 
r=0.6 for a sample n= 10 pairs of observations, then the 95% 
CL for the population correlation coefficient are (-0.05 < 
P <0.87), which are very wide. Notice how increasing the 
sample size shrinks these bounds. For n=100, the intervals 
are (0.47 </3<0.71). 

Table A7 lists the critical values of the sample correlation 
coefficient r for testing the null hypothesis that the popu- 
lation correlation coefficient is statistically significant (i.e., 
p 7^ ) at the 0.05 and 0.01 significance levels for one and 
two tailed tests. The interpretation of these values is of some 
importance in many cases, especially when dealing with 
small data sets. Say, analysis of the 12 monthly bills of a 
residence revealed a linear correlation of r=0.6 with degree- 
days at the location. Assume that a one-tailed test applies. 
The sample correlation suggests the presence of a correla- 
tion at a significance level a =0.05 (the critical value from 
Table A7 is Pc =0.497) while none at a = 0.01, (for which 
Pc =0.658). Whether observed sample correlations are sig- 
nificant or not can be evaluated statistically as illustrated 
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Fig. 4.8 Plot depicting 95% confidence bands for population correlati- 
on in a bivariate normal population for various sample sizes n. The hold 
vertical line defines the lower and upper limits of P when r=0.6 froin 
a data set of 10 pairs of observations. (From Wonnacutt and Wonnacutt 
(1985) by permission of John Wiley and Sons) 

above. Note that certain simplified suggestions on interpre- 
ting values of r in terms of whether they are strong, moderate 
or weak were given by Eq. 3.11; these are to be used with 
caution and were meant as thumb-rules only. 



4.3 ANOVA Test for Multi-Samples 

The statistical methods known as ANOVA (analysis of vari- 
ance) are a broad set of widely used and powerful techniques 
meant to identify and measure sources of variation within 
a data set. This is done by partitioning the total variation in 
the data into its component parts. Specifically, ANOVA uses 
variance information from several samples in order to make 
inferences about the means of the populations from which 
these samples were drawn (and, hence, the appellation). Re- 
call that z-tests and t-tests described previously are used to 
test for differences in one random variable (namely, their 
mean values) between two independent groups. This random 
experimental variable is called a factor in designed experi- 
ments and hypothesis testing. It is obvious that several of 
the cases treated in Sect. 4.2 involve single-factor hypothesis 
tests. ANOVA is an extension of such tests to multiple fac- 
tors or experimental variables; even more generally, multiple 
ANOVA (called MANOVA) analysis can be used to test for 
multiple factor differences of multiple groups. Thus, AN- 
OVA allows one to test whether the mean values of sampled 
data taken from different groups are essentially equal or not. 



i.e., whether the samples emanate from different populations 
or whether they are from the same population. 

This section deals with single factor (or single variable) 
ANOVA methods since they are a logical lead-in to multi- 
variate techniques (discussed in Sect. 4.4) as well as experi- 
mental design methods involving several variables which are 
discussed at more length in Chap. 6. 



4.3.1 Single-Factor ANOVA 

The ANOVA procedure uses just one test for comparing k 
sample means, just like that followed by the two-sample test. 
The following example allows a conceptual understanding of 
the approach. Say, four random samples have been selected, 
one from each of four populations. Whether the sample me- 
ans differ enough to suggest different parent populations can 
be ascertained from the within-sample variation to the varia- 
tion between the four samples. The more the sample means 
differ, the larger will be the between-samples variation, as 
shown in Fig. 4.9b, and the less likely is the probability that 
the samples arise from the same population. The reverse is 
true if the ratio of between-samples variation to that of the 
within-samples is small (Fig. 4.9a). 

ANOVA methods test the null hypothesis of the form: 



(4.18) 



Hq : fii — fi2 — ■ ■ ■ — fJ'k 

Ha : at least two of the /^I's are different 

Adopting the following notation: 

Sample sizes: ni,n2 ■ • • ,«/i 

Sample means: xi,X2 . . .Xk 

Sample standard deviations: si,S2 ■ ■ .Sk 

Total sample size: n — ni + n2 . . . + nk 

Grand average: (x) — weighted average of all n responses 

Then, one defines between-sample variation called "treat- 
ment sum of squares^" (SSTr) as: 

k 

SSTr = ^«,(i; - {x)f with d.f. = k - 1 (4.19) 

and within-samples variation or "error sum of squares" 
(SSE) as: 

k 

SSE = Y] ("' - 1)^/^ with d.f. = n - k (4.20) 

1=1 



' The term "treatment" was originally coined for historic reasons where 
one was interested in evaluating the effect of treatments or changes in 
a product development process. It is now used synonymously to mean 
"classes" from which the samples are drawn. 
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Fig. 4.9 Conceptual explanation 
of the basis of an ANOVA test 
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Together these two sources of variation comprise the "total 
sum of squares" (SST): 



k n 



SST = SSTr + SSE = ^ ^ (% - (i)) 
with d.f. = n — 1 



(4.21) 



SST is simply the sample variance of the combined set of n 
data points = (n,- — 1).?^ where s is the standard deviation of 
all the n data points. 

The statistic defined below as the ratio of two variances is 
said to follow the F-distribution: 



F 



MSTr 
MSE 



where MSTr is the mean between-sample variation 

= SSTr/(k - 1) 
and MSE is the mean total sum of squares 
= SSE/(n-k) 



(4.22) 



(4.23) 



(4.24) 



Recall that the p- value is the area of the F curve for (k- 1, 
n-k) degrees of freedom to the right of F value. If p-value 
< a (the selected significance level), then the null hypothe- 
sis can be rejected. Note that the test is meant to be used for 
normal populations and equal population variances. 

Example 4.3.1:' Comparing mean life of five motor be- 
arings 

A motor manufacturer wishes to evaluate five different mo- 
tor bearings for motor vibration (which adversely results in 
reduced life). Each type of bearing is installed on different 
random samples of six motors. The amount of vibration (in 



' From Devore and Farnum (2005) by © permission of Cengage Lear- 
ning. 



Table 4.4 Vibration values (in microns) for five brands of bearings 
tested on six motor samples (Example 4.3.1) 


Sample 


Brand 1 


Brand 2 


Brand 3 


Brand 4 


Brand 5 


1 


13.1 


16.3 


13.7 


15.7 


13.5 


2 


15.0 


15.7 


13.9 


13.7 


13.4 


3 


14.0 


17.2 


12.4 


14.4 


13.2 


4 


UA 


14.9 


13.8 


16.0 


12.7 


5 


14.0 


14.4 


14.9 


13.9 


13.4 


6 


11.6 


17.2 


13.3 


14.7 


12.3 


Mean 


13.68 


15.95 


13.67 


14.73 


13.08 


Std. dev. 


1.194 


1.167 


0.816 


0.940 


0.479 



microns) is recorded when each of the 30 motors are run- 
ning. The data obtained is assembled in Table 4.4. 

Determine whether the bearing brands have an effect on 
motor vibration at the a =0.05 significance level. In this 
example, k=5, and n=30. The one-way ANOVA table is first 
generated as shown in Table 4.5. 

From the F tables (Table A6) and for a =0.05, the cri- 
tical F value for d.f. = (4,25) is F^ = 2.76, which is less than 
F=8.44 computed from the data. Hence, one is compelled 
to reject the null hypothesis that all five means are equal, 
and conclude that type of bearing motor does have a signi- 
ficant effect on motor vibration. In fact, this conclusion can 
be reached even at the more stringent significance level of 
a = 0.001. 

The results of the ANOVA analysis can be convenient- 
ly illustrated by generating an effects plot, as shown in 
Fig. 4.10a. This illustrates clearly the relationship between 
the mean values of the response variable, i.e., vibration level 



Table 4.5 ANOVA table for Example 


4.3.1 




Source d.f. Sum of Squares 


Mean Square 


F-value 


Factor 5-1=4 SSTr=30.855 


MSTr=7.714 


8.44 


Error 30-5=25 SSE=22.838 


MSE=0.9135 




Total 30-1=29 SST=53.694 
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Fig. 4.10 a Effect plot, b Means 
plot showing the 95% CL inter- 
vals around the mean values of 
the 5 brands (Example 4.3.1) 
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for the five different motor bearing brands. Brand 5 gives 
the lowest average vibration, while Brand 2 has the highest. 
Note that such plots, though providing useful insights, are 
not generally a substitute for an ANOVA analysis. Another 
way of plotting the data is a means plot (Fig. 4.10b) which 
includes 95% CL intervals as well as the information provi- 
ded in Fig. 4.10a. Thus, a sense of the variation within sam- 
ples can be gleaned. ■ 



4.3.2 Tukey's Multiple Comparison Test 

A limitation with the ANOVA test is that, in case the null 
hypothesis is rejected, one is unable to determine the exact 
cause. For example, one poor motor bearing brand could 
have been the cause of this rejection in the example above 
even though the four other brands could be essentially si- 
milar. Thus, one needs to be able to pinpoint the sample 
which leads one to conclude that the test was not signifi- 
cant overall. One could, of course, perform paired compari- 
sons of two brands one at a time. In the case of 5 sets, one 
would then make 10 such tests. Apart from the tediousness 
of such a procedure, making independent paired compari- 
sons leads to a decrease in sensitivity, i.e., type I errors are 
magnified. Hence, procedures that allow multiple compari- 
sons to be made simultaneously have been proposed for this 
purpose (see Manly 2005). One such method is discussed in 
Sect. 4.4.2. 

In this section, the Tukey's significant difference proce- 
dure based on paired comparisons is described which is limi- 
ted to cases of equal sample sizes. This procedure allows the 
simultaneous formation of prespecified confidence intervals 
for all paired comparisons using the Student t-distribution. 
Separate tests are conducted to determine whether M/ = f^j 
for each pair (i,j) of means in an ANOVA study of k popu- 
lation means. Tukey's procedure is based on comparing the 
distance (or absolute value) between any two sample means 
|x, — Xj I to a threshold value T that depends on significance 
level a as well as on the mean square error (MSB) from the 
ANOVA test. The T value is calculated as: 






(4.25) 



where n is the size of the sample drawn from each popu- 
lation, qa values are called the studentized range distribu- 
tion values and are given in Table A8 for a =0.05 for d.f. = 
(k,n-k) 

If \xi — Xj\ >T, then one concludes that /x, ^ iij at the 
corresponding significance level. Otherwise, one concludes 
that there is no difference between the two means. Tukey 
also suggested a convenient visual representation to keep 
track of the results of all these pairwise tests. The Tukey's 
procedure and this representation are illustrated in the follo- 
wing example. 

Example 4.3.2:' Using the same data as that in Example 
4.3.1, conduct a multiple comparison procedure to distingu- 
ish which of the motor bearing brands are superior to the rest. 
Following Tukey's procedure given by Eq. 4.25, the criti- 
cal distance between sample means at a =0.05 is: 



(mseV'^ 



..,^) 



1/2 



1.62 



where <ia is found by interpolation from Table A8 based on 
d.f = (k, n-k) = (5, 25). 

The pairwise distances between the five sample means 
shown in Table 4.6 can be determined, and appropriate in- 
ferences made. 

Thus, the distance T between the following pairs is less 
than 1.62: {1,3;1,4;1,5}, {2,4}, {3,4;3,5}. This information 
is visually summarized in Fig. 4.11 by arranging the five 
sample means in ascending order and then drawing rows 
of bars connecting the pairs whose distances do not exceed 
T= 1.62. It is now clear that though brand 5 has the lowest 
mean value, it is not significantly different from brands 1 and 
3. Hence, the final selection of which motor bearing to pick 
can be made from these three brands only. ■ 



' From Devore and Famum (2005) by © permission of Cengage Lear- 
ning. 
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Table 4.6 

procedure 


Pairwise analysis of the five samples 


following Tukey's 


Samples 


Distance 


Conclusion'' 


1,2 


113.68-15.951 =2.27 


M/ + My 


1,3 


113.68-13.671=0.01 




1,4 


113.68- 14.731 = 1.05 




1,5 


113.68-13.081 = 0.60 




2,3 


15.95- 13.671 =2.28 


At. # Mj 


2,4 


115.95-14.731 = 1.22 




2,5 


115.95-13.081=2.87 


Aii 5^ Mj 


3,4 


113.67- 14.731 = 1.06 




3,5 


113.67- 13.081 =0.59 




4,5 


114.73- 13.081 = 1.65 


M/ 7^ M; 



Only if distance > critical value of 1 .62 





Brand 1 








13.68 






Brand 5 
13.08 


Brand 3 
13.67 


Brand 4 
14.73 


Brand 2 
15.95 


1 


1 






1 


1 








1 


1 



Fig. 4.11 Graphical depiction summarizing the ten pairwise compa- 
risons following Tukey's procedure. Brand 2 is significantly different 
from Brands 1, 3 and 5, and so is Brand 4 from Brand 5 (Example 4.3.2) 



4.4 Tests of Significance of Multivariate Data 

4.4.1 Introduction to Multivariate Methods 

Multivariate analysis (also called multifactor analysis) is the 
branch of statistics that deals with statistical inference and 
model building as applied to multiple measurements made 
from one or several samples taken from one or several popu- 
lations. Multivariate methods can be used to make inferen- 
ces about sample means and variances. Rather than treating 
each measure separately as done in t-tests and single-factor 
ANOVA, multivariate inferential methods allow the analyses 
of multiple measures simultaneously as a system of measu- 
rements. This generally results in sounder inferences to be 
made, a point elaborated below. 

The univariate probability distributions presented in 
Sect. 2.4 can also be extended to bivariate and multivariate 
distributions. Let x^ and x, be two variables of the same type, 
say both discrete (the summations in the equations below 
need to be replaced with integrals for continuous variables). 
Their joint distribution is given by: 



f{M,^i) > and ^ f{.M,^2) = 1 



all{x\^X2) 



(4.26) 



Consider two sets of multivariate data each consisting of 
p variables. However, they could be different in size, i.e., the 
number of observations in each set may be different, say n^ 
and n,. Let Xi and X2 be the sample mean vectors of dimen- 
sion p. For example. 



Xi=[. 



X\\,X\2, . . .X\i,. . .X\ 



>] 



(4.27) 



where ii,- is the sample average over n^ observations of pa- 
rameter i for the first set. 

Further, let Ci and C2 be the sample covariance matrices 
of size (p X p) for the two sets respectively (the basic concepts 
of covariance and correlation were presented in Sect. 3.4.2). 
Then, the sample matrix of variances and covariances for the 
first data set is given by: 



C, = 



C21 



C22 



Clp 



Cp\ Cp2 



c 



pp 



(4.28) 



where c.. is the variance for parameter i and c.^, the covariance 
for parameters i and k. 

Similarly, the sample correlation matrix where the diago- 
nal elements are equal to unity and other terms scaled appro- 
priately, is given by 



R, = 



1 r\2 

r2\ 1 



rip 
rip 



rpi rp2 



1 



(4.29) 



Both matrices contain the correlations between each pair of 
variables, and they are symmetric about the diagonal since, 
say, Cj2=C2| and so on. This redundancy is simply meant to 
allow easier reading. These matrices provide a convenient 
visual representation of the extent to which the different sets 
of variables are correlated with each other, thereby allowing 
strongly correlated sets to be easily identified. Note that cor- 
relations are not affected by shifting and scaling the data. 
Thus, standardizing the variables obtained by subtracting 
each observation by the mean and dividing by the standard 
deviation will still retain the correlation structure of the ori- 
ginal data set while providing certain convenient interpreta- 
tions of the results. 

Underlying assumptions for multivariate tests of signi- 
ficance include the fact that the two samples have close to 
multivariate normal distributions with equal population co- 
variance matrices. The multivariate normal distribution is 
a generalization of the univariate normal distribution when 
p >2 where p is the number of dimensions or parameters. 
Figure 4.12 illustrates how the bivariate normal distribution 
is distorted in the presence of correlated variables. The con- 
tour lines are circles for uncorrelated variables and ellipses 
for coiTelated ones. 
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Fig. 4.12 Two bivariate normal 
distributions and associated 50% 
and 90% contours assuming 
equal standard deviations for 
both variables. However, the 
left hand side plots presume 
the two variables to be uncorre- 
lated, while those on the right 
have a correlation coefficient of 
0.75 which results in elliptical 
contours. (From Johnson and 
Wichern (1988) by © permission 
of Pearson Education) 



'(Xl.Xg) 




4.4.2 HottelingT'Test 

The simplest extension of univariate statistical tests is the si- 
tuation when two or more samples are evaluated to determi- 
ne whether they originate from populations with: (i) different 
means and (ii) different variances/covariances. One can dis- 
tinguish between the following types of multivariate inferen- 
ce tests involving more than one parameter (Manly 2005): 

(a) comparison of mean values for two samples is best done 
using the Hotteling T^-test; 

(b) comparison of variation for two samples (several proce- 
dures have been proposed; the best known are the Box's 
M-test, the Levene's test based on T^-test, and the Van 
Valen test); 

(c) comparison of mean values for several samples (several 
tests are available; the best known are the Wilks' lamb- 
da statistic test, Roy's largest root test, and Pillai's trace 
statistic test); 

(d) comparison of variation for several samples (using the 
Box's M-test). 

Only case (a) will be described here, while the others are 
treated in texts such as Manly (2005). Consider two samples 
with sample sizes n^ and n^. One wishes to compare diffe- 
rences in p random variables among the two samples. Let 
Xi and X2 be the mean vectors of the two samples. A pooled 
estimate of covariance matrix is: 



C = {(«i - l)Ci + («2 - l)C2}/(ni + n2 - 2) 



(4.30) 



where Ci and C2 are the covariance vectors given by 
Eq. 4.28 



Then, the Hotteling's T'-statistic is defined as: 



2 Ml 



n2(Xi-X2)'C-'(Xi-X2) 

(«I +«2) 



(4.31) 



A large numerical value of this statistic suggests that the two 
population mean vectors are different. The null hypothesis 
test uses the transformed statistic: 



(ni + (12 - p - l)T^ 
(ni +n2 -2)p 



(4.32) 



which follows the F-distribution with the number of p and 
(«i + «2 — P — 1) degrees of freedom. 

Since, the T^ statistic is quadratic, it can also be written in 
double sum notation as: 



r2 = 



«1W2 

(ni +112), 



y^ y^ {x\i - X2i)cik{x\k - X2k) (4-33) 



= 1 k=\ 



Example 4.4.1 :* Comparing mean values of two samples by 
pairwise and by Hotteling T^ procedures 
Consider two samples of 5 parameters (p = 5) with paired 
samples. Sample 1 has 21 observations and sample 2 has 
28. The mean and covariance matrices of both these samples 
have been calculated and shown below: 



From Manly (2005) by © permission of CRC Press. 
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and Ci 



157.381 






241.000 






31.433 






18.500 






20.810 






11.048 


9.100 


1.557 


9.100 17.500 


1.910 


1.557 


1.910 


0.531 


0.870 


1.310 


0.189 


1.286 


3.880 


0.240 



c = 



0.870 


1.2861 


1.310 


0.880 


0.189 


0.240 


0.176 


0.133 


0.133 


0.575 



x,= 



158.429 

241.571 

31.479 

18.446 

20.839 



15.069 


17.190 


2.243 


17.190 


32.550 


3.398 


2.243 


3.398 


0.728 


1.746 


2.950 


0.470 


2.931 


4.066 


0.559 



1.746 


2.9311 


2.950 


4.066 


0.470 


0.559 


0.434 


0.506 


0.506 


1.321 



and C2 = 



If one performed paired t-tests with each parameter taken 
one at a time (as described in Sect. 4.2.3), one would compu- 
te the pooled variance for the first parameter as: 



si = [(21- 1)(1 1.048) 
= 13.36 



(28 - 1)(15.069)]/(21 +28-2) 



And the t- statistic as: 

_ (157.381 - 158.429) 



~ 


r 1 




1 M 


13.36 




+ 






V21 




28/] 



= -0.99 



with 47 degrees of freedom. This is not significantly diffe- 
rent from zero as one can note from the p-value indicated 
in Table A4. Table 4.7 assembles similar results for all other 
parameters. One would conclude that none of the five para- 
meters in both data sets are statistically different. 

In order to perform the multivariate test, one first calcula- 
tes the pooled sample covariance matrix (Eq. 4.30): 

Table 4.7 Paired t-tests for each of the five parameters talcen one at 
a time 



Para- 
meter 


First data set 


Second data set 


t-value 
(47 d.f.) 


p-value 




Mean 


Vaiiance 


Mean 


Vaiiance 




1 


157.38 


11.05 


158.43 


15.07 


-0.99 


0.327 


2 


241.00 


17.50 


241.57 


32.55 


-0.39 


0.698 


3 


31.43 


0.53 


31.48 


0.73 


-0.20 


0.842 


4 


18.50 


0.18 


18.45 


0.43 


0.33 


0.743 


5 


20.81 


0.58 


20.84 


1.32 


-0.10 


0.921 



20C1 



27C2 \ 



47 


; 








13.358 


13.748 


1.951 


1.373 


2.231 


13.748 


26.146 


2.765 


2.252 


2.710 


1.951 


2.765 


0.645 


0.350 


0.423 


1.373 


2.252 


0.350 


0.324 


0.347 


2.231 


2.710 


0.423 


0.347 


1.004 



where, for example, the first entry is: (20 xl 1.048 H- 
27 xl5.069)/47 = 13.358. 

The inverse of the matrix C yields 



0.2061 


-0.0694 


-0.2395 


0.0785 


-0.1969 


0.0694 


0.1234 


-0.0376 


-0.5517 


0.0277 


0.2395 


-0.0376 


4.2219 


-3.2624 


-0.0181 


0.0785 


-0.5517 


-3.2624 


11.4610 


-1.2720 


0.1969 


0.0277 


-0.0181 


-1.2720 


1.8068 



Substituting the elements of the above matrix in Eq. 4.33 re- 
sults in: 

, f (21)(28) 
7-2 = [(157.381 - 158.429)(0.2061) 

1(21-^28) 

(157.381 - 158.429) - (157.318 - 158.429) 

(0.0694)(241.000 - 241.571) + ■■■ + (20.810 - 20.839) 

(1.8068)(20.810- 20.839) = 2.824 

which from Eq. 4.32 results in a F-statistic = 

(21-1-28-5- 1)(2.824) 

^ = 0.517 with 5 and 43 d.f. 

(21-h28-2)(5) 

This is clearly not significant since F =2.4 (from Ta- 

-' '^ critical ^ 

ble A6), and so there is no evidence to support that the po- 
pulation means of the two groups are statistically different 
when all five parameters are simultaneously considered. In 
this case one could have drawn such a conclusion directly 
from Table 4.7 by looking at the pairwise p-values, but this 
may not happen always. ■ 

Other than the elegance provided, there are two dis- 
tinct advantages of performing a single multivariate test as 
against a series of univariate tests. The probability of finding 
a type-I result purely by accident increases as the number 
of variables increase, and the multivariate test takes proper 
account of the correlation between variables. The above 
example illustrated the case where no significant differences 
in population means could be discerned either from univa- 
riate tests performed individually or from an overall multi- 
variate test. However, there are instances, when the latter 
test turns out to be significant as a result of the cumulati- 
ve effects of all parameters while any one parameter is not 
significantly different. The converse may also hold, the evi- 
dence provided from one significantly different parameters 
may be swamped by lack of differences between the other 
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Fig. 4.13 Overview of various 
types of parametric hypothesis 
tests treated in this chapter along 
with section numbers. The lower 
set of three sections treat non- 
parametric tests 



Hypothesis Tests 



I 



One sample 



One variable 



Two variables 



Mean/ 
Proportion 



Variance 



Probability 
distribution 



Correlation 
coefficient 



4.2.2/ 
4.2.4(a) 



4.2.5(a) 



4.2.6 



Non-parametric 



4.2.7 



4.5.1 



Two samples 



Multi samples 



One variable 



Multivariate 



One variable 



Mean/ 
Proportion 

4.2.3(a) 

4.2.3(b)/ 

4.2.4(b) 

4.5.2 



Variance Mean 



Mean 



4.2.5(b) 4.4.2 4.3 

Hotteling T'\2 ANOVA 



4.5.3 



parameters. Hence, it is advisable to perform tests as illus- 
trated in the above example. 

The above sections (Sect. 4.2-4.4) treated several cases of 
hypothesis testing. An overview of these cases is provided in 
Fig. 4.13 for greater clarity. The specific sub-section of each 
of the cases is also indicated. The ANOVA case coiTesponds to 
the lower right box, namely testing for differences in the me- 
ans of a single factor or variable which is sampled from seve- 
ral populations, while the Hotteling T^-test corresponds to the 
case when the mecin of several variables from two samples are 
evaluated. As noted above, formal use of statistical methods 
can become very demanding mathematically and computa- 
tionally when multivariate and multisamples are considered, 
and hence the advantage of using numerical based resampling 
methods (discussed in Sect. 4.8). 



4.5 Non-parametric Methods 

The parametric tests described above have implicit built-in 
assumptions regarding the distributions from which the sam- 
ples are taken. Comparison of populations using the t-test 
and F-test can yield misleading results when the random va- 
riables being measured are not normally distributed and do 
not have equal variances. It is obvious that fewer the assump- 
tions, broader would be the potential applications of the test. 
One would like that the significance tests used lead to sound 
conclusions, or that the risk of coming to wrong conclusions 
be minimized. Two concepts relate to the latter aspect. The 
concept of robustness of a test is inversely proportional to 
the sensitivity of the test and to violations of the underlying 
assumptions. The power of a test, on the other hand, is a 



measure of the extent to which cost of experimentation is 
reduced. 

There are instances when the random variables are not 
quantifiable measurements but can only be ranked in order 
of magnitude. For example, a consumer survey respondent 
may rate one product as better than another but is unable 
to assign quantitative values to each product. Data involving 
such "preferences" cannot also be subject to the t and F tests. 
It is under such cases that one has to resort to nonparame- 
tric statistics. Rather than use actual numbers, nonparametric 
tests usually use relative ranks by sorting the data by rank (or 
magnitude), and discarding their specific numerical values. 
Nonparametric tests are generally less powerful than para- 
metric ones, but on the other hand, are more robust and less 
sensitive to outlier points (much of the material that follows 
is drawn from McClave and Benson 1988). 



4.5.1 Test on Spearman Rank Correlation 
Coefficient 

The Pearson correlation coefficient (Sect. 3.4.2) was a pa- 
rametric measure meant to quantify the correlation between 
two quantifiable variables. The Spearman rank correlation 
coefficient r^ is exactly similar but applies to relative ranks. 
The same equation as Eq. 3.10 can be used to compute this 
measure, with its magnitude and sign interpreted in the same 
fashion. However, a simpler formula is often used: 



n{n^ - 1) 



(4.34) 
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Table 4.8 


Data table for Example 4.5.1 showing 


how to conduct the non 


-parametric correlation test 






Faculty 


Research grants ($) 


Teaching 


evaluation 


Research Rank (a) 


Teaching Rank (v,) 


Difference d. 


Diff squared d.^ 


1 


1480,000 


7.05 




5 




7 


-2 


4 


2 


890,000 


7.87 




1 




8 


-7 


49 


3 


3360,000 


3.90 




10 




2 


8 


64 


4 


2210,000 


5.41 




8 




5 


3 


9 


5 


1820,000 


9.02 




7 




9 


-2 


4 


6 


1370,000 


6.07 




4 




6 


-2 


4 


7 


3180,000 


3.20 




9 




1 


8 


64 


8 


930,000 


5.25 




2 




4 


-2 


4 


9 


1270,000 


9.50 




3 




10 


-7 


49 


10 


1610,000 


4.45 




6 




3 


3 


9 
















Total 


260 



where n is the number of paired measurements, and the diffe- 
rence between the ranks for the ith measurement for ranked 
variables u and v is di — u-, — v,- . 

Example 4.5.1: Non-parametric testing of correlation bet- 
ween the sizes of faculty research grants and teaching eva- 
luations 

The provost of a major university wants to determine whet- 
her a statistically significant correlation exists between the 
research grants and teaching evaluation rating of its senior 
faculty. Data over three years has been collected as assem- 
bled in Table 4.8 which also shows the manner in which 
ranks have been generated and the quantities dj — u-, — v, 
computed. 

Using Eq. 4.34 with n= 10: 



6(260) 
10(100- 1) 



-0.576 



Thus, one notes that there exists a negative correlation bet- 
ween the sample data. However, whether this is significant 
for the population correlation coefficient P.? can be ascertai- 
ned by means of a statistical test: 

Hq : p, = (there is no significant correlation) 
Ha : Ps ^ (there is sigiflcant correlation) 



the probability distributions of the sampled populations are 
different or not. The test is nonparametric and no restriction 
is placed on the distribution other than it needs to be conti- 
nuous and symmetric. 

(a) The Wilcoxon rank sum test is meant for independent 
samples where the individual observations can be ran- 
ked by magnitude. The following example illustrates 
the approach. 

Example 4.5.2: Ascertaining whether oil company resear- 
chers and academics differ in their predictions of future at- 
mospheric carbon dioxide levels 

The intent is to compare the predictions in the change of at- 
mospheric carbon dioxide levels between researchers who 
are employed by oil companies and those who are in acade- 
mia. The gathered data shown in Table 4.9 in percentage in- 
crease in carbon dioxide from the current level over the next 
10 years from 6 oil company researchers and seven acade- 
mics. Perform a statistical test at the 0.05 significance level 
in order to evaluate the following hypotheses: 

(a) Predictions made by oil company researchers differ 
from those made by academics. 

(b) Predictions made by oil company researchers tend to be 
lower than those made by academics. 



Table AlO in Appendix A gives the absolute cutoff values for 
different significance levels. For n= 10, the critical value for 
a — 0.05 is 0.564, which suggests that the correlation can 
be deemed to be significant at the 0.05 significance level, but 
not at the 0.025 level whose critical value is 0.648. ■ 



4.5.2 Wilcoxon RankTests — Two Sample 
and Paired Tests 

Rather than compare specific parameters (such as the mean 
and the variance), the non-parametric tests evaluate whether 



Table 4.9 Wilcoxon rank test calculation for paired independent sam- 
ples (Example 4.5.2) 





Oil Company Researchers 


Academics 
Prediction (%) 






Prediction (%) 


Rank 


Rank 


1 


3.5 


4 


4.7 


6 


2 


5.2 


7 


5.8 


9 


3 


2.5 


2 


3.6 


5 


4 


5.6 


8 


6.2 


11 


5 


2.0 


1 


6.1 


10 


6 


3.0 


3 


6.3 


12 


7 


- 


- 


6.5 


13 


Sum 




25 




66 
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(a) First, ranks are assigned as shown for the two groups 
of individuals combined. Since there are 13 predictions, the 
ranks run from 1 through 13 as shown in the table. The test 
statistic is based on the sum totals of each group (and hence 
its name). If they are close, the implication is that there is no 
evidence that the probability distributions of both groups are 
different; and vice versa. 

Let T^ and T^ be the rank sums of either group. Then 



Table 4.10 Wilcoxon signed rank test calculation for paired non-in- 



Ta + Tb 



n(n + l) 13(13 + 1) 



91 (4-35) 



where n= Hj H-n^ with nj = 6 and n^=l. Note that n^ should be 
selected as the one with fewer observations. A small value of 
T^ implies a large value of T^, and vice versa. Hence, grea- 
ter the difference between both the rank sums, greater the 
evidence that the samples come from different populations. 
Since one is testing whether the predictions by both groups 
are different or not, the two-tailed significance test is appro- 
priate. Table Al 1 provides the lower and upper cutoff values 
for different values of n^ and n^ for both the one-tailed and 
the two-tailed tests. Note that the lower and higher cutoff 
values are (28, 56) at 0.05 significance level for the two- 
tailed test. The computed statistics of T^=25 and Tg = 66 are 
outside the range, the null hypothesis is rejected, and one 
would conclude that the predictions from the two groups are 
different. 

(b) Here one wishes to test the hypothesis that the predic- 
tions by oil company researchers is lower than those made 
by academics. Then, one uses a one-tailed test whose cutoff 
values are given in part (b) of Table All. These cutoff valu- 
es at 0.05 significance level are (30, 54) but only the lower 
value of 30 is used in this case. The null hypothesis will be 
rejected only if T^<30. Since this is so, the above data sug- 
gests that the null hypothesis can be rejected at a significance 
level of 0.05. ■ 

(b) The Wilcoxon signed rank test is meant for paired 
tests where samples taken are not independent. This 
is analogous to the two sample paired difference test 
treated in Sect. 4.2.3b. As before, one deals with one 
variable involving paired differences of observations or 
data. This is illustrated by the following example. 

Example 4.5.3: Evaluating predictive accuracy of two cli- 
mate change models from expert elicitation 
A policy maker wishes to evaluate the predictive accuracy 
of two different climate change models for predicting short- 
term (say, 30 years) carbon dioxide changes in the atmo- 
sphere. He consults 10 experts and asks them to grade these 
models on a scale from 1 to 10, with 10 being extremely 
accurate. Clearly, this data is not independent since the same 
expert is asked to make two value judgments about the mo- 
dels being evaluated. Data shown in Table 4.10 is obtained 
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> (Example 


4.5.3) 






Expert 


Model A 


Model B 


Difference 
(A-B) 


Absolute 
difference 


Rank 


1 


6 


4 


2 


2 


5 


2 


8 


5 


3 


3 


7.5 


3 


4 


5 


-1 


1 


2 


4 


9 


8 


1 


1 


2 


5 


4 


1 


3 


3 


7.5 


6 


7 


9 


-2 


2 


5 


7 


6 


2 


4 


4 


9 


8 


5 


3 


2 


2 


5 


9 


6 


7 


-1 


1 


2 


10 


8 


2 


6 


6 


10 










Sum of positive 
ranks T^ 


=46 










Sum of negative 
ranks T 


= 9 



(note that these are not ranked values, except for the last co- 
lumn but the grades from 1 to 10 assigned by the experts): 

The paired differences are first computed (as shown in 
the table) from which the ranks are generated based on the 
absolute differences, and finally the sums of the positive 
and negative ranks are computed. Note how the ranking has 
been assigned since there are repeats in the absolute diffe- 
rence values. There are three "1" in the absolute difference 
column. Hence a mean value of rank "2" has been assigned 
for all 3 three. Similarly for the three absolute differences 
of "2", the rank is given as "5", and so on. For the highest 
absolute difference of "6", the rank is assigned as "10". The 
values shown in last two rows of the table are also simple 
to deduce. The values of the difference (A-B) column are 
either positive or negative. One simply adds up all the rank 
values coiTesponding to the cases when (A-B) is positive 
and also when they are negative. These are found to be 46 
and 9 respectively. 

The test statistic for the null hypothesis is 
T — min (TL, T+) . In our case, T=9. The smaller the value 
of T, the stronger the evidence that the difference between 
both distributions is important. The rejection region for T 
is determined from Table A 12. The two-tailed critical value 
for n= 10 at 0.05 significance level is 8. Since the computed 
value for T is higher, one cannot reject the null hypothesis, 
and so one would conclude that there is not enough evidence 
to suggest that one of the models is more accurate than the 
other at the 0.05 significance level. Note that if a significance 
level of 0.10 were selected, the null hypothesis would have 
been rejected. 

Looking at the ratings shown in the table, one notices that 
these seem to be generally higher for model A than model 
B. In case one wishes to test the hypothesis, at a significan- 
ce level of 0.05, that researchers deem model B to be less 
accurate than model A, one would have used r_ as the test 
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statistic and compared it to the critical value of a one-tailed 
column values of Table A12. Since the critical value is 11 
for n= 10, which is greater than 9, one would reject the null 
hypothesis. ■ 



4.5.3 Kruskall-Wallis— Multiple Samples Test 

Recall that the single-factor ANOVA test was described in 
Sect. 4.3.1 for infening whether mean values from several 
samples emanate from the same population or not, with the 
necessary assumption of normal distributions. The Kruskall- 
Wallis H test is the nonparametric equivalent. It can also be 
taken to be the extension or generalization of the rank-sum 
test to more than two groups. Hence, the test applies to the 
case when one wishes to compare more than two groups 
which may not be normally distributed. Again, the evaluation 
is based on the rank sums where the ranking is made based 
on samples of all k groups combined. The test is framed as 
follows: 

Hj : All populations have identical probability distributions 
H^ : Probability distributions of at least two populations are 
different 

Let Ri, R2,R3 denote as the rank sums of say, three samples. 
The H-test statistic measures the extent to which the three 
samples differ with respect to their relative ranks, and is gi- 
ven by: 



H = 



12 
n{n + 1) 



E 



Ri 



3(n + 1) (4-36) 



where k is the number of groups, n is the number of obser- 
vations in the jth sample and n is the total sample size. Thus, 
if the H statistic is close to zero, one would conclude that all 
groups have the same mean rank, and vice versa. The distri- 
bution of the H statistic is approximated by the chi-square 
distribution, which is used to make statistical inferences. The 
following example illustrates the approach. 



Table 4.1 1 


Data table for Example 4.5.4 






Agriculture 

# employees 


Rank 


Manufacturing 

# employees Rank 


Service 

# employees 


Rank 


1 10 




5 


244 


25 


17 


9.5 


2 350 




27 


93 


19 


249 


26 


3 4 




2 


3532 


30 


38 


15 


4 26 




13 


17 


9.5 


5 


3 


5 15 




8 


526 


29 


101 


20 


6 106 




21 


133 


22 


1 


1 


7 18 




11 


14 


7 


12 


6 


8 23 




12 


192 


23 


233 


24 


9 62 




17 


443 


28 


31 


14 


10 8 




4 


69 


18 


39 


16 






120 




210.5 




1*3 = 

134.5 



First, the ranks for all samples from the three classes are 
generated as shown tabulated under the 2nd, 4th and 6th 
columns. The values of the sums R are also computed and 
shown in the last row. Note that n = 30, while n. = lO.The test 
statistic H is computed first: 



// = 



12 



30(31) 
99.097 



120^ 210.5^ 

+ 



10 
93 = 



10 
6.097 



134.5 
10 



2-\ 



3(31) 



The degrees of freedom is the number of groups minus one, 
or 3- 1 =2. From the Chi-square tables (Table A5), the criti- 
cal value at a =0.05 is 5.991. Since the computed H value 
exceeds this threshold, one would reject the null hypothesis 
at 95% CL and conclude that at least two of the three proba- 
bility distributions describing the number of employees in the 
sectors are different. However, the verdict is marginal since 
the computed H statistic is close to the critical value. It would 
be wise to consider the practical implications of the statistical 
inference test, and perform a decision analysis study. ■ 
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Example 4.5.4:' Evaluating probability distributions of 
number of employees in three different occupations using a 
non-parametric test 

One wishes to compare, at a significance level of 0.05, the 
number of employees in companies representing each of 
three different business classifications, namely agriculture, 
manufacturing and service. Samples from ten companies 
each were gathered which are shown in Table 4.11. Since the 
distributions are unlikely to be normal (there are some large 
numbers), a nonparametric test is appropriate. 



' From McClave and Benson (1988) by © permission of Pearson Edu- 
cation. 



4.6.1 Background 

The Bayes' theorem and how it can be used for probability 
related problems has been treated in Sect. 2.5. Its strength 
lies in the fact that it provides a framework for including 
prior information in a two-stage (or multi-stage) experiment 
whereby one could draw stronger conclusions than one could 
with observational data alone. It is especially advantageous 
for small data sets, and it was shown that its predictions 
converge with those of the classical method for two cases: 
(i) as the data set of observations gets larger; and (ii) if the 
prior distribution is modeled as a uniform distribution. It was 
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pointed out that advocates of the Bayesian approach view 
probability as a degree of belief held by a person about an 
uncertainty issue as compared to the objective view of long 
run relative frequency held by traditionalists. This section 
will discuss how the Bayesian approach can also be used to 
make statistical inferences from samples about an uncertain 
quantity, and also used for hypothesis testing problems. 



4.6.2 Inference About One Uncertain Quantity 

Consider the case when the population mean ^ is to be esti- 
mated (point and interval estimates) from the sample mean 
X with the population assumed to be Gaussian with a known 
standard deviation cr.This case is given by the sampling dis- 
tribution of the mean x treated in Sect. 4.2. 1 . The probability 
P of a two-tailed distribution at significance level a can be 
expressed as: 

where n is the sample size and z is the value from the stan- 
dard normal tables. The traditional interpretation is that one 
can be (1 — a) confident that the above interval contains the 
true population mean. However, the interval itself should not 
be interpreted as a probability interval for the parameter. 

The Bayesian approach uses the same formula but the 
mean and standard deviation are modified since the posterior 
distribution is now used which includes the sample data as 
well as the prior belief. The confidence interval is usually 
narrower than the traditional one and is referred to as the 
credible interval or the Bayesian confidence interval. The 
interpretation of this credible interval is somewhat different 
from the traditional confidence interval: there is a (1 — a) 
probability that the population mean falls within the interval. 
Thus, the traditional approach leads to a probability state- 
ment about the interval, while the Bayesian about the popu- 
lation parameter (Phillips 1973). 

The relevant procedure to calculate the credible intervals 
for the case of a Gaussian population and a Gaussian prior 
is presented without proof below (Wonnacutt and Wonnacutt 
1985). Let the prior distribution, assumed normal, be charac- 
terized by a mean /io and variance a^ , while the sample 
values are x and s^ . Selecting a prior distribution is equivalent 
to having a quasi-sample of size n^ whose size is given by: 

«o = ^ (4.38) 



Note that the expression for the posterior mean is simply the 
weighted average of the sample and the prior mean, and is li- 
kely to be less biased than the sample mean alone. Similarly, 
the standard deviation is divided by the total normal sample 
size and will result in increased precision. However, had a 
different prior rather than the normal distribution been assu- 
med above, a slightly different interval would have resulted 
which is another reason why traditional statisticians are un- 
easy about fully endorsing the Bayesian approach. 

Example 4.6.1 : Comparison of classical and Bayesian con- 
fidence intervals 

A certain solar PV module is rated at 60 W with a standard 
deviation of 2 W. Since the rating varies somewhat from one 
shipment to the next, a sample of 12 modules has been selec- 
ted from a shipment and tested to yield a mean of 65 W and 
a standard deviation of 2.8 W. Assuming a Gaussian distri- 
bution, determine the 95% confidence intervals by both the 
traditional and the Bayesian approaches. 

(a) Traditional approach: 

/i = i ± 1.964^ = 65 ± 1.96^r7y = 65 ± 1.58 

„l/2 12^12 

(b) Bayesian approach. Using Eq. 4.38tocalculatethequasi- 
sample size inherent in the prior: 

«o = -^^ = 1.96-2.0 



2.8-' 



i.e., the prior is equivalent to information from an additional 
2 modules tested. 

Next, Eq. 4.39 is used to determine the posterior mean 
and standard deviation: 

2(60) + 12(65) 



and a* 



2+12 
2.8 



64.29 



0.748 



(2 + 12)^/2 
The Bayesian 95% confidence interval is then: 

At = /i* ± 1.96 CT* = 64.29 ± 1.96(0.748) 
= 62.29 ±1.47 

Since prior information has been used, the Bayesian interval 
is likely to be centered better and be more precise (with a 
narrower interval) than the classical interval. 



The posterior mean and standard deviation /x* and a* are 
then given by: 



M = 



"oM-o ■ 



andcr* — 



Sx 



no 



(mo + n) 



1/2 



(4.39) 



4.6.3 Hypothesis Testing 

Section 4.2 dealt with the traditional approach to hypothe- 
sis testing where one frames the problem in terms of two 
competing claims. The application areas discussed involved 
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testing for single sample mean, testing for two sample and 
paired differences, testing for single and two sample vari- 
ances, testing for distributions, and testing on the Pearson 
correlation coefficient. In all these cases, one proceeds by 
defining two hypotheses: 

(a) The null hypothesis which represents the status quo, 
i.e., that the hypothesis will be accepted unless the data 
provides convincing evidence of the contrary. 

(b) The research or alternative hypothesis (H ) which is the 
premise that the variation observed in the data sample 
cannot be ascribed to random variability or chance alo- 
ne, and that there must be some inherent structural or 
fundamental cause. 

Thus, the traditional or frequentist approach is to divide 
the sample space into an acceptance region and a rejection 
region, and posit that the null hypothesis can be rejected only 
if the probability of the test statistic lying in the rejection 
region can be ascribed to chance or randomness at the prese- 
lected significance level a. Advocates of the Bayesian appro- 
ach have several objections to this line of thinking (Phillips 
1973): 

(i) the null hypothesis is rarely of much interest. The preci- 
se specification of, say, the population mean is of limited 
value; rather, ascertaining a range would be more use- 
ful; 
(ii) the null hypothesis is only one of many possible values 
of the uncertain variable, and undue importance being 
placed on this value is unjustified; 
(iii) as additional data is collected, the inherent randomness 
in the collection process would lead to the null hypothe- 
sis to be rejected in most cases; 
(iv) erroneous inferences from a sample may result if prior 
knowledge is not considered. 
The Bayesian approach to hypothesis testing is not to 
base the conclusions on a traditional significance level like 
p<0.05. Instead it makes use of the posterior credible in- 
terval introduced in the previous section. The procedure is 
summarized below for the instance when one wishes to test 
the population mean ju of the sample collected against a prior 
mean value ju^ (Bolstad 2004). 

(a) One sided hypothesis test: Let the posterior distribu- 
tion of the mean value be given by g(/i/xi ,.. .x„) . The 
hypothesis test is set up as: 



Ho : fi < fio versus Hi : fi> fj.o 



(4.40) 



Let a be the significance level assumed (usually 0.10, 0.05 
or 0.01). Then, the posterior probability of the null hypo- 
thesis, for the special case when the posterior distribution is 
Gaussian: 



P(Ho : jj. < iio/x\„x„) = P z < 



Mo - M 



a' 



(4.41) 



where z is the standard normal variable with jjl* and a* gi- 
ven by Eq. 4.39. If the probability is less than our selected 
value of a, the null hypothesis is rejected, and one concludes 
that M > Mo . 

(b) Two sided hypothesis test: In this case, one is testing 
for 



Hq : 11 — yiQ versus Hi : fi ^ fiQ 



(4.42) 



A slightly different approach is warranted since one is de- 
aling with continuous variables for which the probability of 
them assuming a specific value is nil. Here, one calculates 
the (1 — a) credible interval for// using our posterior distri- 
bution. If Mo is outside these intervals, the null hypothesis is 
rejected; and vice versa. 

Example 4.6.2: Traditional and Bayesian approaches to 
determining confidence levels 

The life of a certain type of smoke detector battery is speci- 
fied as having a mean of 32 months and a standard deviation 
of 0.5 months. The variable can be assumed to have a Gaus- 
sian distribution. A building owner decides to test this claim 
at a significance level of 0.05. He tests a sample of 9 batteries 
and finds a mean of 3 1 and a sample standard deviation of 1 
month. Note that this is a one-side hypothesis test case, 
(a) The traditional approach would entail testing 
Hq : fi <32 versus Hi : m > 32. The Student t value: 

31-32 

-3.0 . From Table A4, the critical va- 



1/V9 

lue for d.f. = 8 is ro.os = —1.86 . Thus, he can reject the 
null hypothesis, and state that the claim of the manufac- 
turer is incorrect, 
(b) The Bayesian approach, on the other hand, would re- 
quire calculating the posterior probability of the null 
hypothesis. The prior distribution has a mean Mo =32 



and variance cr^ =0.5'. 



1= 



First, use Eq. 4.38, and determine nn = — ^ = 4, i.e., 

0.5^ 

the prior information is "equivalent" to increasing the samp- 
le size by 4. Next, use Eq. 4.39 to determine the posterior 
mean and standard deviation: 



4(32) + 9(31) 



and a* 



4 + 9 
1.0 



31.3 



(4 + 9) 



1/2 



0.277. 



From here: t — 



32.0-31.3 
0.277 



— 2.53. From the student 



t table (Table A4) for d.f. = (9H-4-l) = 12, this corresponds 
to a confidence level of less than 99% or a probability of 
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less than 0.01. Since this is lower than the selected signifi- 
cance level a — 0.05 , he can reject the null hypothesis. In 
this case, both approaches gave the same result, but someti- 
mes one would reach different conclusions especially when 
sample sizes are small. ■ 



4.7 Sampling Methods 

4.7.1 Types of Sampling Procedures 

A sample is a portion or limited number of items from a 
larger entity called population of which information and 
characteristic traits are sought. Point and interval estima- 
tion as well as notions of inferential statistics covered in the 
previous sections involved the use of samples drawn from 
some underlying population. The premise was that finite 
samples would reduce the expense associated with the es- 
timation; this being viewed as more critical than the asso- 
ciated uncertainty which would consequently creep into the 
estimation process. It is quite clear that the sample drawn 
must be representative of the population. However, there 
are different ways by which one could draw samples; this 
aspect falls under the purview of sampling design. Since 
these have different implications, they are discussed in this 
section. 

There are three general rules of sampling design: 
(i) the more representative the sample, the better the re- 
sults; 
(ii) all else being equal, larger samples yield better results, 

i.e., the results are more precise; 
(iii) larger samples cannot compensate for a poor sampling 
design plan or a poorly executed plan. 
Some of the common sampling methods are described 
below: 

(a) random sampling (also called simple random sam- 
pling) is the simplest conceptually, and is most widely 
used. It involves selecting the sample of n elements in 
such as way that all possible samples of n elements have 
the same chance of being selected. Two important stra- 
tegies of random sampling involve: 
(i) sampling with replacement, in which the object se- 
lected is put back into the population pool and has 
the possibility to be selected again in subsequent 
picks, and 
(ii) sampling without replacement, where the object pi- 
cked is not put back into the population pool prior 
to picking the next item. 
Random sampling without replacement of N objects 
from a population n could be practically implemented 
in one of several ways. The most common is to order 
the objects of the population (say as 1, 2, 3 . . . n), use a 
random number generator to generate N numbers from 



1 to n without replication, and pick only the objects 
whose numbers have been generated. This approach is 
illustrated by means of the following example. A con- 
sumer group wishes to select a sample of 5 cars from a 
lot of 500 cars for crash testing. It assigns integers from 
1 to 500 to each and every car on the lot, uses a random 
number generator to select a set of 5 integers, and the- 
reby select the 5 cars corresponding to the 5 integers 
picked randomly. 

Dealing with random samples has several advantages: 
(i) any random sub-sample of a random sample or its 
complement is also a random sample; (ii) after a random 
sample has been selected, any random sample from its 
complement can be added to it to form a larger random 
sample, 
(b) non-random sampling occurs when the selection of 
members from the population is done according to 
some method or pre-set process which is not random. 
Often it occurs unintentionally or unwittingly with the 
experimenter thinking that he is dealing with random 
samples while he is not. In such cases, bias or skewness 
is introduced, and one obtains misleading confidence 
limits which may lead to erroneous inferences depen- 
ding on the degree of non-randomness in the data set. 
However, in some cases, the experimenter intentionally 
selects the samples in a non-random manner and analy- 
zes the data accordingly. This can result in the required 
conclusions being reached with reduced sample sizes, 
thereby saving resources. There are different types of 
nonrandom sampling (ASTM E 1402 1996), and some 
of the important ones are listed below: 
(i) stratified sampling in which the target population is 
such that it is amenable to partitioning into disjoint 
subsets or strata based on some criterion. Samples 
are selected independently from each stratum, pos- 
sibly of different sizes. This improves efficiency 
of the sampling process in some instances, and is 
discussed at more length in Sect. 4.7.4; 
(ii) cluster sampling in which strata are first generated 
(these are synonymous to clusters), then random 
sampling is done to identify a subset of clusters, 
and finally all the elements in the picked clusters 
are selected for analysis; 
(iii) sequential sampling is a quality control procedure 
where a decision on the acceptability of a batch of 
products is made from tests done on a sample of 
the batch. Tests are done on a preliminary sample, 
and depending on the results, either the batch is 
accepted or further sampling tests are performed. 
This procedure usually requires, on an average, fe- 
wer samples to be tested to meet a pre-stipulated 
accuracy. 
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(iv) composite sampling where elements from different 
samples are combined together; 

(v) multistage or nested sampling which involves se- 
lecting a sample in stages. A larger sample is first 
selected, and then subsequently smaller ones. For 
example, for testing indoor air quality in a popu- 
lation of office buildings, the design could involve 
selecting individual buildings during the first stage 
of sampling, choosing specific floors of the selec- 
ted buildings in the second stage of sampling, and 
finally, selecting specific rooms in the floors cho- 
sen to be tested during the third and final stage. 

(vi) convenience sampling, also called opportunity 
sampling, is a method of choosing samples arbitra- 
rily following the manner in which they are acqui- 
red. If the situation is such that a planned experi- 
mental design cannot be followed, the analyst has 
to make do with the samples in the sequence they 
are acquired. Though impossible to treat rigorous- 
ly, it is commonly encountered in many practical 
situations. 



4.7.2 Desirable Properties of Estimators 

The parameters from a sample are random variables since 
different sets of samples will result in different values for 
the parameters. Recall the definition of two seemingly ana- 
logous, but distinct, terms: an estimate is a specific number, 
while an estimator is a random variable. Since the search for 
estimators is the crux of the parameter estimation process, 
certain basic notions and desirable properties of estimators 
need to be explicitly recognized (a good discussion is provi- 
ded by Pindyck and Rubinfeld 1981). Many of these concepts 
are logical extensions of the concepts applicable to errors, 
and also apply to regression models treated in Chap. 5. For 
example, consider the case where inferences about the po- 
pulation mean parameter M are to be made from the sample 
mean estimator x ■ 

(a) Lack of bias: A very desirable property is for the dis- 
tribution of the estimator to have the parameter as its 
mean value (see Fig. 4.14). Then, if the experiment 
were repeated many times, one would at least be as- 




Biased 






CO 
O 



Efficient 
estimator 




(b) 



Actual value 
Fig. 4.15 Concept of efficiency of estimators 



sured that one would be right on an average. In such a 
case, the bias in E{x — /i) = , where E represents 
the expected value. 

Efficiency: Lack of bias provides no indication regar- 
ding the variability. Efficiency is a measure of how small 
the dispersion can possibly get. The value x is said to 
be an efficient unbiased estimator if, for a given sample 
size, the variance of x is smaller than the variance of 
any unbiased estimator (see Fig. 4.15) and is the smal- 
lest limiting variance that can be achieved. More often 
a relative order of merit, called the relative efficiency, 
is used which is defined as the ratio of both variances. 
Efficiency is desirable since the greater the efficiency 
associated with an estimation process, the stronger the 
statistical or inferential statements one can make about 
the estimated parameters. 

Consider the following example (Wonnacutt and Won- 
nacutt 1985). If a population being sampled is symme- 
tric, its center can be estimated without bias by either 
the sample mean x or its median x. For some populati- 
ons x is more efficient; for others x is more efficient. In 
case of a normal parent distribution, the standard error 
of x = SE(x)=1.25CT/Vn. Since SE(x)=cr/Vn, effi- 
ciency of X relative to 



(c) 



= 1.25^ = 1.56. 



(4.43) 



Actual value 
Fig. 4.14 Concept of biased and unbiased estimators 



Mean square error: There are many circumstances in 
which one is forced to trade off bias and variance of 
estimators. When the goal of a model is to maximize the 
precision of predictions, for example, an estimator with 
very low variance and some bias may be more desira- 
ble than an unbiased estimator with high variance (see 
Fig. 4.16). One criterion which is useful in this regard is 
the goal of minimizing mean square error (MSE), de- 
fined as: 

MSE (i) = E(i - ixf = [Bias (i)]^ + var {x) (4.44) 
where E(x) is the expected value of x. 
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CO 



Minimum mean 
square error 




Actual value 

Fig. 4.16 Concept of mean square error which includes bias and effi- 
ciency of estimators 



Thus, when x is unbiased, the mean square error and 
variance of the estimator x are equal. MSE may be re- 
garded as a generalization of the variance concept. This 
leads to the generalized definition of the relative effi- 
ciency of two estimators, whether biased or unbiased: 
''efficiency is the ratio of both MSE values."' 
(d) Consistency: Consider the properties of estimators as 
the sample size increases. In such cases, one would like 
the estimator x to converge to the true value, or the pro- 
bability limit of X {plim x) should equal fi as sample 
size n approaches infinity (see Fig. 4.17). This leads to 
the criterion of consistency: i is a consistent estima- 
tor of ^ if plim (x)=fi. In other words, as the sample 
size grows larger, a consistent estimation would tend to 
approximate the true parameters, i.e., the mean squa- 
re error of the estimator approaches zero. Thus, one of 
the conditions that make an estimator consistent is that 
both its bias and variance approach zero in the limit. 
However, it does not necessarily follow that an unbiased 
estimator is a consistent estimator. Although consisten- 
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Fig. 4.17 A consistent estimator is one whose distribution becomes 
gradually peaked as the sample size n is increased 



cy is an abstract concept, it often provides a useful pre- 
liminary criterion for sorting out estimators. However, 
to finally determine the best estimator, the efficiency 
is a more powerful criterion. As discussed earlier, the 
sample mean is preferable to the median for estimating 
the center of a normal population because the former is 
more efficient though both estimators are clearly con- 
sistent and unbiased. 
As a general rule, one tends to be more concerned with 
consistency than with lack of bias as an estimation criterion. 
A biased yet consistent estimator may not equal the true pa- 
rameter on average, but will approximate the true parameter 
as the sample information grows larger. This is more reas- 
suring practically than the alternative of finding a parame- 
ter estimate which is unbiased initially, yet will continue to 
deviate substantially from the true parameter as the sample 
size gets larger. 



4.7.3 Determining Sample Size During 
Random Surveys 

Population census, market surveys, pharmaceutical field 
trials, etc.. are examples of survey sampling. These can be 
done in one of two ways which are discussed in this section 
and in the next. The discussion and equations presented 
in the previous sub-sections pertain to random sampling. 
Survey sampling frames the problem using certain terms 
slightly different from those presented above. Here, a ma- 
jor issue is to determine the sample size which can meet 
a certain pre-stipulated precision at predefined confidence 
levels. 

The estimates from the sample should be close enough to 
the population characteristic so as to be useful for drawing 
conclusions and taking subsequent decisions. One generally 
assumes the underlying probability distribution to be normal. 
Let RE be the relative error (also called the margin of error 
or bound on error of estimation) of the population mean ^ at 
a confidence level (1 — a), which, for a two-tailed distribu- 
tion, is defined as: 



RE 



\-a 



^a/l- 



11 



(4.45) 



where a is the standard error given by Eq. 4.2. 

A measure of variability in the population needs to be in- 
troduced, and this is done through the coefficient of variation 
(CV) defined as: 



CV = 



std.dev. 
true mean 






where s is the sample standard deviation. The maximum va- 
lue of s which would allow the confidence level to be met is: 
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Zo,l2-CV\-a-X 



(4.46) 



One could deduce the required sample size n from the ab- 
ove equation to reach the target RE\_a as follows: First, a 
simplifying assumption is made by replacing (N- 1) by N in 
Eq. 4.3 which is the expression for the standard error of the 
mean for small samples. Then 



n 



N ■ 



N 



n 



N 



(4.47) 



Finally, using the definitions of RE and CV stated above, the 
required sample size is: 



1 



1 

N 



RE 



\-a 



^ajl-CVl-a 



(4.48) 



A? 



This is the functional form normally used in survey sampling 
in order to determine sample size provided some prior es- 
timate of the population mean and standard deviation are 
known. 

Example 4.7.1: Determination of random sample size nee- 
ded to verify peak reduction in residences at preset confiden- 
ce levels 

An electric utility has provided financial incentives to a 
large number of their customers to replace their existing air- 
conditioners with high efficiency ones. This rebate program 
was initiated in an effort to reduce the aggregated electric 
peak during hot summer afternoons which is dangerously 
close to the peak generation capacity of the utility. The utili- 
ty analyst would like to determine the sample size necessary 
to assess whether the program has reduced the peak as pro- 



jected such that the relative error RE< 10% at 90% CL. The 
following information is given: 

The total number of customers: N=20000 

Estimate of the mean peak saving /x = 2 kW (from engi- 
neering calculations) 

Estimate of the standard deviation s = 1 kW (from enginee- 
ring calculations) 

This is a two-tailed distribution problem with 90% CL 

which corresponds to a one-tailed significance level of a/2 

= (100-90)/2/100=0.05. Then, from Table A4,zo.o5 = 1-65. 

s^ 1 
Inserting values of RE=0.1 and CV — ^ — - — 0.5 in 

Eq. 4.48, the required sample size is: 



1^ 



1 



0.1 



(1.65) -(0.5). 



1 



658.2 ^ 660 



20,000 



It would be advisable to perform some sensitivity runs given 
that many of the assumed quantities are guess-estimates. It is 
simple to use the above approach to generate figures such as 
Fig. 4.18 for assessing tradeoff between increased accuracy 
with sample size and increase in cost of instrumentation, in- 
stallation and monitoring as sample size is increased. 

Note that accepting additional error reduces sample size 
in a hyperbolic manner. For example, lowering the require- 
ment that RE < ±10% to < ±15% decreases n from 660 
to about 450, while increasing it to < ±5% would require a 
sample size of about 1300 (about twice the initial estimate). 
On the other hand, there is not much one could do about va- 
rying CV since this represents an inherent variability in the 
population if random sampling is adopted. However, non- 
random stratified sampling, described next, could be one ap- 
proach to reduce sample sizes. ■ 



Fig. 4.18 Size of random sample 
needed to acliieve different 
relative errors of the population 
mean for two different values 
of population variability (CV of 
25% and 50%, Example 4.7.1) 
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4.7.4 Stratified Sampling for Variance 
Reduction 



Men: 45, 50, 55, 40, 90 

Women: 80, 50, 120, 80, 200, 180, 90, 500, 320, 75 



Variance reduction techniques are a special type of sample 
estimating procedures which rely on the principle that prior 
knowledge about the structure of the model and the proper- 
ties of the input can be used to increase the precision of es- 
timates for a fixed sample size, or, conversely decrease the 
sample size required to obtain a fixed degree of precision. 
These techniques distort the original problem so that special 
techniques can be used to obtain the desired estimates at a 
lower cost. 

Variance can be decreased by considering a larger samp- 
le size which involves more work. So the effort with which 
a parameter is estimated can be evaluated as: efficiency = 
(variance x work)"'. This implies that a reduction in variance 
is not worthwhile if the work needed to achieve it is excessi- 
ve. A common recourse among social scientists to increase 
efficiency is to use stratified sampling, which counts as a 
variance reduction technique. In stratified sampling, the dis- 
tribution function to be sampled is broken up into several 
pieces, each piece is then sampled separately, and the results 
are later combined into a single estimate. The specification 
of the strata to be used is based on prior knowledge about 
the characteristics of the population to be sampled. Often an 
order of magnitude variance reduction is achieved by strati- 
fied sampling as compared to the standard random sampling 
approach. 



It is intuitively clear that such data will lead to a more accu- 
rate estimate of the overall average than would the expendi- 
tures of 12 men and 3 women. 

The appropriate weights must be applied to the original 
sample data if one wishes to deduce the overall mean. Thus, 
if M and W are used to designate the i* sample of men and 
women, respectively, 



X = 



1 

Is 

1 
Is 



> M, + > W 

^^0.33 ^^0.67 

Li=l 1=1 

0.80 0.20 

280+ 1695 

0.33 0.67 



$79 



where 0.80 and 0.20 are the original weights in the popula- 
tion, and 0.33 and 0.67 the sample weights respectively. 

This value is likely to be a more realistic estimate than if 
the sampling had been done based purely on the percentage 
of the gender of the customers. The above example is a simp- 
le case of stratified sampling where the customer base was 
first stratified into the two genders, and then these were sam- 
pled disproportionately. There are statistical formulae which 
suggest near-optimal size of selecting stratified samples, for 
which the interested reader can refer to Devore and Farnum 
(2005) and other texts. 



Example 4.7.2:'" Example of stratified sampling for vari- 
ance reduction 

Suppose a home improvement center wishes to estimate the 
mean annual expenditure of its local residents in the hard- 
ware section and the drapery section. It is known that the 
expenditures by women differ more widely than those by 
men. Men visit the store more frequently and spend annu- 
ally approximately $ 50; expenditures of as much as $ 100 
or as little as $ 25 per year are found occasionally. Annual 
expenditures by women can vary from nothing to over $ 500. 
The variance for expenditures by women is therefore much 
greater, and the mean expenditure more difficult to estimate. 
Assume that 80% of the customers are men and that a 
sample size of 15 is to be taken. If simple random sampling 
were employed, one would expect the sample to consist of 
approximately 12 men (80% of 15) and 3 women. However, 
assume that a sample that included 5 men and 10 women 
was selected instead (more women have been preferentially 
selected because their expenditures are more variable). Sup- 
pose the annual expenditures of the members of the sample 
turned out to be: 



'"From Hines and Montgomery (1990) by © permission of Jolin Wiley 
and Sons. 



4.8 Resampling Methods 

4.8.1 Basic Concept and Types of Methods 

The precision of a population related estimator can be impro- 
ved by drawing multiple samples from the population, and 
inferring the confidence limits from these samples rather than 
determining them from classical analytical estimation formu- 
lae based on a single sample only. However, this is infeasible 
in most cases because of the associated cost and time of as- 
sembling multiple samples. The basic rationale behind resam- 
pling methods is to draw one sample, treat this original samp- 
le as a surrogate for the population, and generate numerous 
sub-samples by simply resampling the sample itself. Thus, re- 
sampling refers to the use of given data, or a data generating 
mechanism, to produce new samples from which the required 
estimator can be deduced numerically. It is obvious that the 
sample must be unbiased and be reflective of the population 
(which it will be if the sample is drawn randomly), otherwise 
the precision of the method is severely compromised. 

Efron and Tibshirani (1982) have argued that given the 
available power of computing, one should move away from 
the constraints of traditional parametric theory with its over- 
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reliance on a small set of standard models for which theo- 
retical solutions are available, and substitute computational 
power for theoretical analysis. This parallels the manner in 
which numerical methods have in large part replaced closed 
forms solution techniques in almost all fields of engineering 
mathematics. Thus, versatile numerical techniques allow 
one to overcome such problems as the lack of knowledge 
of the probability distribution of the errors of the variables, 
and even determine sampling distributions of such quanti- 
ties as the median or of the inter-quartile range or even the 
5th and 95th percentiles for which no traditional tests exist. 
The methods are conceptually simple, requiring low levels 
of mathematics, while the needed computing power is easi- 
ly provided by present-day personal computers. Hence, they 
are becoming increasingly popular and are used to comple- 
ment classical/traditional parametric tests. 

Resampling methods can be applied to diverse problems 
(Good 1999): (i) for determining probability in complex si- 
tuations, (ii) to estimate confidence levels of an estimator 
during univariate sampling of a population, (iii) hypothesis 
testing to compare estimators of two samples, (iv) to esti- 
mate confidence bounds during regression, and (v) for clas- 
sification. These problems can all be addressed by classical 
methods provided one makes certain assumptions regarding 
probability distributions of the random variables. The ana- 
lytic solutions can be daunting to those who use these sta- 
tistical analytic methods rarely, and one can even select the 
wrong formula by error Resampling is much more intuitive 
and provides a way of simulating the physical process wit- 
hout having to deal with the, sometimes obfuscating, statis- 
tical constraints of the analytic methods. They are based on 
a direct extension of ideas from statistical mathematics and 
have a sound mathematical theory. A big virtue of resampling 
methods is that they extend classical statistical evaluation to 
cases which cannot be dealt with mathematically. 

The downside to the use of these methods in that they 
require large computing resources (of the order of 1000 and 
more samples). This issue is no longer a constraint because 
of the computing power of modern day personal computers. 
Resampling methods are also referred to as computer-inten- 
sive methods, though other techniques discussed in Sect. 10.6 
are more often associated with this general appellation. It has 
been suggested that one should use a parametric test when 
the samples are large, say number of observations is greater 
than 40, or when they are small (<5) (Good 1999). The re- 
sampling provides protection against violation of parametric 
assumptions. 

The creation of multiple sub-samples from the original 
sample can be done in several ways and distinguishes one 
method against the other The three most common resam- 
pling methods are: 

(a) Permutation method (or randomization method) is one 
where all possible subsets of r items (which is the sub- 



sample size) out of the total n items (the sample size) 
are generated, and used to deduce the population esti- 
mator and its confidence levels or its percentiles. This 
may require some effort in many cases, and so, an equi- 
valent and less intensive deviant of this method is to use 
only a sample of all possible subsets. The size of the 
sample is selected based on the accuracy needed, and 
about 1000 samples are usually adequate. 
The use of the permutation method when making in- 
ferences about the medians of two populations is illus- 
trated below. The null hypothesis is that the there is no 
difference between the two populations. First, one sam- 
ples both populations to create two independent random 
samples. The difference in the medians between both 
samples is computed. Next, two subsamples without re- 
placement are created from the two samples, and the dif- 
ference in the medians between both resampled groups 
recalculated. This is done a large number of times, say 
1000 times. The resulting distribution contains the ne- 
cessary information regarding the statistical confidence 
in the null hypothesis of the parameter being evaluated. 
For example, if the difference in the median between 
the two original samples was lower in 50 of 1000 possi- 
ble subsamples, then one concludes that the one-tailed 
probability of the original event was only 0.05. It is cle- 
ar that such a sampling distribution can be done for any 
statistic of interest, not just the median. However, the 
number of randomizations become quickly very large, 
and so one has to select the number of randomizations 
with some care. 

(b) The jackknife method creates subsamples with replace- 
ment. The jackknife method, introduced by Quenouille 
in 1949 and later extended by Tukey in 1958, is a tech- 
nique of universal applicability that allows confidence 
intervals to be determined of an estimate calculated 
from a sample while reducing bias of the estimator. 
There are several numerical schemes for implementing 
the jackknife scheme. One version is: (i) to divide the 
random sample of n observations into g groups of equal 
size (ii) omit one group at a time and determine what 
are called pseudo-estimates from the (g- 1) groups, (iii) 
estimate the actual confidence intervals of the parame- 
ters. A more widespread method of implementation is 
to simply create n subsamples with (n-1) data points 
wherein a single different observation is omitted in each 
subsample. 

(c) The bootstrap method (popularized by Efron in 1979) is 
similar but differs in that no groups are formed but the 
different sets of data sequences are generated by simply 
sampling with replacement from the observational data 
set (Davison and Hinkley 1997). Individual estimators 
deduced from such samples permit estimates and con- 
fidence intervals to be determined. The analyst has to 
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select the number of randomizations while the sample 
size is selected to be equal to that of the original sample. 
The method would appear to be circular, i.e., how can 
one acquire more insight by resampling the same samp- 
le? The simple explanation is that "the population is to 
the sample as the sample is to the bootstrap sample" . 
Though the jackknife is a viable method, it has been 
supplanted by the bootstrap method which has emerged 
as the most efficient of the resampling methods in that 
better estimates of standard errors and confidence limits 
are obtained. Several improvements to the naive boots- 
trap have been proposed (such as the bootstrap-t met- 
hod) especially for long-tailed distributions or for time 
series data. The bootstrap method will be discussed at 
more length in Sect. 4.8.3 and Sect. 10.6. 



4.8.2 Application to Probability Problems 



Table 4.1 2 Data 


table for Example 


4.8.1 








62 50 


53 


57 


41 


53 


55 


61 


59 64 


50 


53 


64 


62 


50 


68 


54 55 


57 


50 


55 


50 


56 


55 


46 55 


53 


54 


52 


47 


47 


55 


57 48 


63 


57 


57 


55 


53 


59 


53 52 


50 


55 


60 


50 


56 


58 



3. Calculate the statistic of interest for the sample in step 2 

4. Repeat steps 2 and 3 a large number of times to form an 
approximate sampling distribution of the statistic. 

It is important to note that bootstrapping requires that 
sampling be done with replacement, and about 1000 sam- 
ples are required. It is advised that the analyst perform a few 
evaluations with different number of samples in order to be 
more confident about his results. The following example il- 
lustrates the implementation of the bootstrap method. 



How resampling methods can be used for solving probabi- 
lity type of problems are illustrated below (Simon 1992). 
Consider a simple example, where one has six balls labeled 
1 to 6. What is the probability that three balls will be picked 
such that they have 1, 2, 3 in that order if this is done with 
replacement. The traditional probability equation would yi- 
eld (1/6)^ The same result can be determined by simulating 
the 3-ball selection a large number of times. This approach, 
though tedious, is more intuitive since this is exactly what 
the traditional probability of the event is meant to represent; 
namely, the long run frequency. One could repeat this 3- 
ball selection say a million times, and count the number of 
times one gets 1, 2, 3 in sequence, and from there infer the 
needed probability. The procedure rules or the sequence of 
operations of drawing samples has to be written in com- 
puter code, after which the computer does the rest. Much 
more difficult problems can be simulated in this manner, 
and its advantages lie in its versatility, its low level of mat- 
hematics required, and most importantly, its direct bearing 
with the intuitive interpretation of probability as the long- 
run frequency. 



4.8.3 Application of Bootstrap to Statistical 
Inference Problems 

The use of the bootstrap method to two types of instances is 
illustrated: determining confidence intervals and for correla- 
tion analysis. At its simplest, the algorithm of the bootstrap 
method consists of the following steps (Devore and Farnum 
2005): 

1 . Obtain a random sample of size n from the population 

2. Generate a random sample of size n with replacement 
from the original sample in step 1 . 



Example 4.8.1:" Using the bootstrap method for deducing 
confidence intervals 

The data in Table 4.12 corresponds to the breakdown voltage 
(in kV) of an insulating liquid which is indicative of its die- 
lectric strength. Determine the 95% CL. 

First, use the large sample confidence interval formula 
to estimate the 95% CL intervals of the mean. Summary 
quantities are : sample size n=48, ^x,=2646 and 
^x? = 144,950 from which x — 54.7 and standard devia- 
tion s = 5.23. The 95% CL interval is then: 



54.7 ±1.96 



.5.23 



48 



54.7 ± 1.5 = (53.2,56.2) 



The confidence intervals using the bootstrap method are now 
recalculated in order to evaluate differences. A histogram 
of 1000 samples of n=48 each, drawn with replacement, is 
shown in Fig. 4.19. The 95% confidence intervals corre- 
spond to the two-tailed 0.05 significance level. Thus, one 
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Fig. 4.19 Histogram of bootstrap sample means with 1000 samples 
(Example 4.8.1) 



' From Devore and Famum (2005) by © pennission of Cengage Learning. 
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selects 1000(0. 05/2)=25 units from each end of the distri- 
bution, i.e., the value of the 25th and that of the 975th largest 
values which yield (53.2, 56.1) which are very close to the 
parametric range determined earlier. This example illustrates 
the fact that bootstrap intervals usually agree with traditio- 
nal parametric ones when all the assumptions underlying the 
latter are met. It is when they do not, that the power of the 
bootstrap stands out. ■ 

The following example illustrates the versatility of the 
bootstrap method for determining correlation between two 
variables, a problem which is recast as comparing two samp- 
le means. 

Example 4.8.2:'^ Using the bootstrap method with a nonpa- 
rametric test to ascertain correlation of two variables 
One wishes to determine whether there exists a correlation 
between athletic ability and intelligence level. A sample of 
10 high school athletes was obtained involving their athletic 
and I.Q. scores. The data is listed in terms of descending or- 
der of athletic scores in the first two columns of Table 4.13. 
A nonparametric approach is adopted to solve this pro- 
blem. The athletic scores and the I.Q. scores are rank ordered 
from 1 to 10 as shown in the last two columns of the table. 
The two observations (athletic rank, I.Q. rank) are treated 
together since one would like to determine their joint beha- 
vior. The table is split into two groups of five "high" and five 
"low". An even split of the group is advocated since it uses 
the available information better and usually leads to better 
"efficiency". The sum of the observed I.Q. ranks of the five 
top athletes =(3-i-1h-7h-4h-2) = 17. The resampling scheme 
will involve numerous trials where a subset of 5 numbers is 
drawn randomly from the set { 1 ... 10}. One then adds these 
five numbers for each individual trial. If the observed sum 
across trials is consistently higher than 17, this will indicate 
that the best athletes will not have earned the observed I.Q. 
scores purely by chance. The probability can be directly esti- 
mated from the proportion of trials whose sum exceeded 17. 
Figure 4.20 depicts the histogram of the sum of 5 random 



Table 4.1 3 Data table for Examp] 


le 4.8.2 along with ranks 


Athletic score I.Q. Score 


Athletic rank I.Q. Rank 


97 114 


1 3 


94 120 


2 1 


93 107 


3 7 


90 113 


4 4 


87 118 


5 2 


86 101 


6 8 


86 109 


7 6 


85 110 


8 5 


81 100 


9 9 


76 99 


10 10 



'-Froin Simon (1992) by © pemiission of Duxbury Press. 



observations using 100 trials (a rather low number of trials 
meant for illustration purposes). Note that in only 2% of the 
trials was the sum 17 or lower. Hence, one can state to within 
98% confidence, that there does exist a correlation between 
athletic ability and I.Q. level. ■ 



Problems 

Pr. 4.1 The specification to which solar thermal collectors 
are being manufactured requires that their lengths be between 
8.45-8.65 feet and their width between 1.55-1.60 ft. The 
modules produced by a certain assembly line have lengths 
that are normally distributed about a mean of 8.56 ft with 
standard deviation 0.05 ft, and widths also normally distri- 
buted with a mean of 1.58 ft with standard deviation 0.01 ft. 
For the modules produced by this assembly line, find: 

(a) the % that will not be within the specified limits for 
length; state the implicit assumption in this approach 

(b) the % that will not be within the specified limits for 
width; state the implicit assumption in this approach 

(c) the % that will not meet the specifications; state the im- 
plicit assumption in this approach. 

Pr. 4.2 The pH of a large lake is to be determined for which 
purpose 9 test specimens were collected and tested to give: 
{6.0, 5.7, 5.8, 6.5, 7.0, 6.3, 5.6, 6.1, 5.0}. 

(a) Calculate the mean pH for the 9 specimens 

(b) Find an unbiased estimate of the standard deviation of 
the population of all pH samples 

(c) Find the 95% CL interval for the mean of this popula- 
tion if it is known from past experience that the pH va- 
lues have a standard deviation of 0.6. State the implicit 
assumption in this approach 

(d) Find the 95% CL interval for the mean of this popula- 
tion if no previous information about pH value is avai- 
lable. State the implicit assumption in this approach. 

Pr. 4.3 Two types of cement brands A and B were evaluated 
by testing the compressive strength of concrete blocks made 
from them. The results for 7 test pieces for A and 5 for B are 
shown in Table 4.14. 

(a) Estimate the mean difference between both types of 
concrete 

(b) Estimate the 95% confidence limits of the difference of 
both types of concrete 

(c) Repeat the above using the bootstrap method with 1000 
samples and compare results. 

Pr. 4.4 Using classical and Bayesian approaches to verify 
claimed benefit of gasoline additive 

An inventor claims to have developed a gasoline additive 
which increases gas mileage of cars. He specifically states 
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Fig. 4.20 Histogram based on 
100 trials of the sum of 5 random 
ranks from the sample of 10. 
Note that in only 2% of the trials 
was the sum equal to 17 or lower 
(Example 4.8.2) 
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Sum of 5 ranks 



Table 4.14 Data table for Problem 4,3 



BrandA 2.18 3.17 2.46 2.70 


2.78 


3.35 


3.52 


Brand B 3.13 3.07 3.92 3.51 


2.92 






Table 4.1 5 Data table for Problem 4.4 








Regular 395 420 405 


417 


399 


410 


With additive 440 436 447 


453 


444 


426 



that tests using a specific model and make of car resulting 
in an increase of 50 miles per filling. An independent testing 
group repeated the tests on six identical cars and obtained the 
results shown in Table 4.15. 

(a) Assuming normal distribution, perform parametric tests 
at the 95% CL to verify the inventor's claim 

(b) Repeat the problem using the bootstrap method with 
1000 samples and compare results. 

Pr. 4.5" Analyzing distribution of radon concentration in 
homes 

The Environmental Protection Agency (EPA) determined 
that an indoor radon concentration in homes of 4 pCi/L was 
acceptable though there is an increased cancer risk level for 
humans of 10 ''. Indoor radon concentrations in 43 residen- 
ces were measured randomly, as shown in Table 4.16. 

The Binomial distribution is frequently used in risk as- 
sessment since only two states or outcomes can exist: either 
one has cancer or one does not. 

(a) Determine whether the normal or the t-distributions 
better represent the data 



Table 4.1 6 Data table for Problem 4.5 






4.04 


4.38 


2.90 


4.47 


2.73 


0.74 


4.60 


5.05 


2.87 


1.72 


3.08 


4.01 


5.73 


4.04 


6.48 


3.26 


3.25 


1.22 


5.39 


3.48 


3.74 


6.01 


6.08 




2.37 


5.25 


3.99 


3.40 


5.15 




5.39 


1.80 


0.89 


3.96 


2.73 




4.60 


4.93 


3.72 


2.82 


5.87 




5.05 


3.83 


3.51 


3.41 


3.77 





(b) 
(c) 



(d) 
(e) 



Compute the standard deviation 

Use the one-tailed test to evaluate whether the mean va- 
lue is less than the threshold value of 4 pCi/L at the 90% 
CL 

At what confidence level can one state that the true 
mean is less than 6 pCi/L ? 
Compute the range for the 90% confidence intervals. 



"From Kammen and Hassenzahl (1999) by © permission of Princeton 
University Press. 



Pr. 4.6 Table 4.17 assembles radon concentrations in pCi/L 
for U.S. homes. Clearly it is not a normal distribution. Re- 
searchers have suggested using the lognormal distribution or 
the power law distribution with exponent of 0.25. Evaluate 
which of these two functions is more appropriate: 

(a) graphically using quantile plots 

(b) using appropriate statistical tests for distributions. 

Pr. 4.7 Using survey sample to determine proportion of po- 
pulation in favor of off-shore wind farms 

A study is initiated to estimate the proportion of residents 
in a certain coastal region who do not favor the construction 
of an off-shore wind farm. The state government decides that 
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Table 4.1 7 Data table for Problem 4.6 






Concentration 
Level (pCi/L) 


% homes 


Concentration 
Level (pCi/L) 


% homes 


0.25 




16 


2.75 




2 


0.50 




18 


3.00 




3 


0.75 




13 


3.25 




1 


1.00 




10 


3.50 




2 


1.25 




8 


3.75 




1 


1.50 




6 


4.00 




2 


1.75 




5 


4.25 




2 


2.00 




4 


4.50 




1 


2.25 




3 


4.75 




1 


2.50 




2 


5.00 








Table 4.1 9 1 


Data table for Problem 4.9 








Test # 1 


2 3 4 5 


6 


7 


8 


Fan A 55 


52 51 59 60 


56 


54 


54 


FanB 46 


55 59 50 47 


62 


53 


55 



if the fraction of those against wind farms at the 95% CL 
drops to less than 0.50, then the wind farm permit will likely 
be granted. 

(a) A survey of 200 residents at random is taken from 
which 90 state that they are not in favor. Based on this 
data, would the permit be granted or not? 

(b) If the survey size is increased, what factors could in- 
tervene which could result in a reversal of the above 
course of action? 

(c) A major public relation campaign is initiated by the 
wind farm company in an effort to sway public opinion 
in their favor. After the campaign, a new sample of 100 
residents at random was taken, and now only 30 stated 
that they were not in favor. Did the fraction of residents 
change significantly at the 95% CL from the previous 
fraction? 

(d) Would a permit likely to be granted in this case? 

Pr. 4.8 Using ANOVA to evaluate pollutant concentration 
levels at different times of day 

The transportation department of a major city is concerned 
with elevated air pollution levels during certain times of the 
day at some key intersections. Samples of SO, in (iig/m^) 
are taken at three locations during three different times of the 
day as shown in Table 4.18. 



(a) 



(b) 
(c) 



Conduct an ANOVA test to determine whether the mean 
concentrations of SO^ differ during the three collection 
periods at a =0.05 
Create an effects plot of the data 
Use Tukey's multiple comparison procedure to determi- 
ne which collection periods differ from one another. 



Pr. 4.9 Using non-parametric tests to identify the better of 
two fan models 

The facility manager of a large campus wishes to replace the 
fans in the HVAC system of his buildings. He narrows down 
the possibilities to two manufacturers and wishes to use Wil- 
coxon Rank sum at significance level a =0.05 to identify the 
better fan manufacturer based on the number of hours of opera- 
tion prior to servicing. Table 4.19 assembles such data (in hun- 
dreds of hours) generated by an independent testing agency: 

Pr. 4.10 Parametric test to evaluate relative performance of 
two PV systems from sample data 

A consumer advocate group wishes to evaluate the perfor- 
mance of two different types of photovoltaic (PV) panels 
which are very close in terms of rated performance and cost. 
They convince a builder of new homes to install 2 panels 
of each brand on two homes in the same locality with care 
taken that their tilt and orientation towards the sun are iden- 
tical. The test protocol involves monitoring these two PV 
panels for 15 weeks and evaluating the performance of the 
two brands based on their weekly total electrical output. The 
weekly total electrical output in kWh is listed in Table 4.20. 
The monitoring equipment used is identical in both locati- 
ons and has an absolute error of 3 kWh/week at 95% uncer- 
tainty level. Evaluate using parametric tests whether the two 
brands are different at a significance level of a =0.05 with 
measurement errors being explicitly considered. 

Pr. 4.11 Comparing two instruments using parametric, non- 
parametric and bootstrap methods 

A pyranometer meant to measure global solar radiation is 
being cross-compared with a primary reference instrument. 
Several simultaneous observations in (kW/m^) were taken 
with both instruments deployed side by side as shown in Ta- 
ble 4.21. Determine, at a significance level a =0.05, whether 
the secondary field instrument differs from the primary ba- 
sed on: 

(a) Parameteric tests 

(b) Non-parametric tests 

(c) The bootstrap method with a sample size of 1000. 



Table 4.1 8 Data table for Problem 4.8 




Collection time Location A Location B 


Location C 


7 am 50 80 


62 


Noon 45 52 


48 


6 pm 57 74 


68 



Pr. 4.12 Repeat Example 4.2. 1 using the Bayesian approach 
assuming: 

(a) the sample of 36 items tested have a mean of 15 years 
and a standard deviation of 2.5 years 

(b) the same mean and standard deviation but the sample 
consists of 9 items only. 
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Table 4.20 Weekly electrical output in kWh (Problem 4.10) 



Week 


Brand A 


Brand B 


Week 


Brand A 


Brand B 


Week 


Brand A 


Brand B 


1 


197 


189 


6 


203 


187 


11 


174 


170 


2 


202 


199 


7 


165 


160 


12 


225 


218 


3 


148 


142 


8 


121 


115 


13 


242 


232 


4 


246 


248 


9 


146 


138 


14 


206 


213 


5 


173 


176 


10 


189 


173 


15 


197 


193 



Table 4.21 Data table for Problem 


4.11 














Observation Reference 


Secondary 


Observation 


Reference 


Secondary 


Observation 


Reference 


Secondary 


1 0.96 


0.93 




6 


0.89 


0.91 


11 


0.84 


0.81 


2 0.82 


0.78 




7 


0.64 


0.62 


12 


0.59 


0.55 


3 0.75 


0.76 




8 


0.81 


0.77 


13 


0.94 


0.87 


4 0.61 


0.64 




9 


0.68 


0.63 


14 


0.91 


0.86 


5 0.77 


0.74 




10 


0.65 


0.62 





Pr. 4.13 An electrical motor company states that one of 
their product lines of motors has a mean life 8100 h with 
a standard deviation of 200 h. A wholesale dealer purcha- 
ses a consignment and tests 10 of the motors. The sample 
mean and standard deviation are found to be 7800 h with 
a standard deviation of 100 h. Assume normal distribution. 
Compute: 

(a) The 95 % confidence interval based on the classical ap- 
proach 

(b) The 95 % confidence interval based on the Bayesian 
approach 

(c) The probability that the consignment has a mean value 
less than 4000 h. 

Pr. 4.14''' The average cost of electricity to residential cus- 
tomers during the three summer months is to be determi- 
ned. A sample of electric cost in 25 residences is collected 
as shown in Table 4.22. Assume a normal distribution with 
standard deviation of 80. 

(a) If the prior value is a Gaussian with N(325, 80), find the 
posterior distribution for the mean M 

(b) Find a 95% Bayesian credible interval for M 

(c) Compare the interval with that from the traditional method 

(d) Perform a traditional test for: H^ : fi = 350 versus 
H^ : fi ^ 350 at the 0.05 significance level 



Table 4.22 Data table for Problem 4.14 



514 


536 


345 


440 


427 


443 


386 


418 


364 


483 


506 


385 


410 


561 


275 


306 


294 


402 


350 


343 


480 


334 


324 


414 


296 



(e) Perform a Bayesian test of the hypothesis: 
//o : M < 350 versus //i' : /x > 350 at the 0.05 signifi- 
cance level. 

Pr. 4.15 Comparison of human comfort correlations bet- 
ween Caucasian and Chinese subjects 
Human indoor comfort can be characterized by to the oc- 
cupants' feeling of well-being in the indoor environment. 
It depends on several interrelated and complex phenomena 
involving subjective as well as objective criteria. Research 
initiated over 50 years back and subsequent chamber studies 
have helped define acceptable thermal comfort ranges for in- 
door occupants. Perhaps the most widely used standard is 
ASHRAE Standard 55-2004 (ASHRAE 2004). The basis of 
the standard is the thermal sensation scale determined by the 
votes of the occupants following the scale in Table 4.23. 

The individual votes of all the occupants are then aver- 
aged to yield the predicted mean vote (PMV). This is one 
of the two indices relevant to define acceptability of a large 
population of people exposed to a certain indoor environ- 
ment. PMV = is defined as the neutral state (neither cool 
nor warm), while positive values indicate that occupants feel 
warm, and vice versa. The mean scores from the chamber 
studies are then regressed against the influential environ- 
mental parameters so as to yield an empirical correlation 
which can be used as a means of prediction: 



PMV ^a*Tdh + b*Py+c* 



(4.49) 



where Tdh is the indoor dry-bulb temperature (degrees C), 
Py is the partial pressure of water vapor (kPa), and the nu- 



'■•From Bolstad (2004) by © permission of John Wiley and Sons. 



Table 4.23 ASHRAE thermal sensation classes 






+3 +2 +1 -1 


-2 


-3 


Hot Warm Slighfly Neutral Slightly 
warm cool 


Cool 


Cold 
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Table 4.24 Regression parameters 
(Eq. 4.49) for 3 h exposure 


of the ASHRAE PMV model 


Sex a* 


b* c* 


Male 0.212 


0.293 -5.949 


Female 0.275 


0.255 -8.622 



Combined 



0.243 



0.278 



-6.802 



merical values of the coefficients a*, b* and c* are depen- 
dent on such factors as sex, age, hours of exposure, clothing 
levels, type of activity, .... The values relevant to healthy 
adults in an office setting for a 3 h exposure period are given 
in Table 4.24. 

In general, the distribution of votes will always show con- 
siderable scatter. The second index is the percentage of peo- 
ple dissatisfied (PPD), defined as people voting outside the 
range of - 1 to H- 1 for a given value of PMV. When the PPD 
is plotted against the mean vote of a large group characteri- 
zed by the PMV, one typically finds a distribution such as 
that shown in Fig. 4.21. This graph shows that even under 
optimal conditions (i.e., a mean vote of zero), at least 5% are 
dissatisfied with the thermal comfort. Hence, because of in- 
dividual differences, it is impossible to specify a thermal en- 
vironment that will satisfy everyone. A correlation between 
PPD and PMV has also been suggested: 



PPD =100-95 



exp [-0.03353.fMV* + 0.2179.PMy2)] 



(4.50) 



Note that the overall approach is consistent with the statisti- 
cal approach of approximating distributions by the two pri- 
mary measures, the mean and the standard deviation. Howe- 
ver, in this instance, the standard deviation (characterized by 
PPD) has been empirically found to be related to the mean 
value, namely PMV (Eq. 4.50). 

A research study was conducted in China by Jiang (2001) 
in order to evaluate whether the above types of correlati- 
ons, developed using American and European subjects, are 
applicable to Chinese subjects as well. The environmental 
chamber test protocol was generally consistent with previ- 
ous Western studies. The total number of Chinese subjects 







PMV 



Fig. 4.21 Predicted percentage of dissatisfied (PPD) as function of 
predicted mean vote (PMV) following Eq. 4.50 



in the pool was about 200, and several tests were done with 
smaller batches (about 10-12 subjects per batch evenly split 
between males and females). Each batch of subjects first 
spent some time in a pre-conditioning chamber after which 
they were moved to the main chamber. The environmental 
conditions (dry-bulb temperature T^^^, relative humidity RH 
and air velocity) of the main chamber were controlled such 
that: Tdh{ ± 0.3°C), RH{ ± 5%) and air velocity < 0.15 m/s. 
The subjects were asked to vote about every V2 hr over IVi h 
in accordance with the 7-point thermal sensation scale. Ho- 
wever, in this problem we consider only the data relating to 
averages of the two last votes corresponding to 2 and IVi h 
since only then was it found that the voting had stabilized 
(this feature of the length of exposure is also consistent with 
American/European tests). 

Three separate sets each consisting of 1 8 tests were per- 
formed; one for females only, one for males only, and one for 
combined'\ The chamber Tdh,RH and the associated partial 
pressure of water needed in Eq. 4.49 (which can be determi- 
ned from psychrometric relations) along with the PMV and 
PPD measures are tabulated as shown in Table P4. 15 (see 
Appendix B). The conditions under which these were done 
is better visualized if plotted on a psychrometric chart shown 
in Fig. 4.22. Based on this data, one would like to determine 
whether the psychological responses of Chinese people are 
different from those of American/European people. 
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Fig. 4.22 Chamber test conditions plotted on a psychrometric chart for 
Chinese subjects 



"This data set was provided by Wei Jiang for which we are grateful. 
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4 Making Statistical Inferences from Samples 



Hint: One of the data points is suspect. Also use Eqs. 4.49 
and 4.50 to generate the values pertinent to Western subjects 
prior to making comparative evaluations. 

(a) Formulate the various different types of tests one would 
perform stating the intent of each test 

(b) Perform some or all of these tests and draw relevant 
conclusions 

(c) Prepare a short report describing your entire analysis. 
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Estimation of Linear Model Parameters Using 
Least Squares 



5 



This chapter deals with methods to estimate parameters of 
linear parametric models using ordinary least squares (OLS). 
The univariate case is first reviewed along with equations for 
the uncertainty in the model estimates as well as in the model 
predictions. Several goodness-of-fit indices to evaluate the 
model fit are also discussed, and the assumptions inherent 
in OLS are highlighted. Next, multiple linear models are 
treated, and several notions specific to correlated regressors 
are presented. The insights which residual analysis provides 
are discussed, and different types of remedial actions to im- 
proper model residuals are addressed. Other types of linear 
models such as splines and models with indicator variables 
are discussed. Finally, a real-world case study analysis which 
was meant to verify whether actual field tests supported the 
claim that a refrigerant additive improved chiller thermal 
performance is discussed. 



5.1 Introduction 

The analysis of observational data or data obtained from 
designed experiments often requires the identification of a 
statistical model or relationship which captures the underly- 
ing structure of the system from which the sample data was 
drawn. A model is a relation between the variation of one 
variable (called the dependent or response variable) against 
that of other variables (called independent or regressor va- 
riables). If observations (or data) are taken of both respon- 
se and regressor variables under various sets of conditions, 
one can build a mathematical model from this information 
which can then be used as a predictive tool under different 
sets of conditions. How to analyze the relationships among 
variables and determine a (if not "the") optimal relation, falls 
under the realm of regression model building or regression 
analysis. 

Models, as stated in Sect. 1.1, can be of different forms, 
with mathematical models being of sole concern in this book. 
These can divided into: 



(i) parameteric models which can be a single function (or a 
set of functions) capturing the variation of the response 
variable in terms of the regressors. The intent is to iden- 
tify both the model function and determine the values of 
the parameters of the model along with some indication 
of their uncertainty; and 
(ii) nonparametric models where the relationship between 
response and regressors is such that a mathematical mo- 
del in the conventional sense is inadequate. Nonpara- 
meteric models are treated in Sect. 9.3 in the framework 
of time series models and in Sect. 11.3.2 when dealing 
with artificial neural network models. 
The parameters appearing in parametric models can be es- 
timated in a number of ways, of which ordinary least squares 
(OLS) is the most common and historically the oldest. Other 
estimation techniques are described in Chap. 10. There is a 
direct link between how the model parameters are estima- 
ted and the underlying joint probability distributions of the 
variables, which is discussed below and in Chap. 10. In this 
chapter, only models linear in the parameters are addressed 
which need not necessarily be linear models (see Sect. 1.2.4 
for relevant discussion). However, often, the former are loo- 
sely referred to as linear parametric models. 



5.2 Regression Analysis 

5.2.1 Objective of Regression Analysis 

The objectives of regression analysis are: (i) to identify the 
"besf ' model among several candidates in case the physics 
of the system does not provide an unique mechanistic rela- 
tionship, and (ii) to determine the "best" values of the model 
parameters; with "best" being based on some criterion yet to 
be defined. Desirable properties of estimators, which are vie- 
wed as random variables, have been described in Sect. 4.7.2, 
and most of these concepts apply to parameter estimation of 
regression models as well. 
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Fig. 5.1 Ordinary Least Squares Regression (OLS) is based on finding 
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regression model building because identification of the sys- 
tem structure via a regression line from sample data has an 
obvious parallel to inferring population mean from sample 
data (discussed in Sect. 4.2.1). The intent of regression is to 
capture or "explain" via a model the variation in y for diffe- 
rent X values. Taking a simple mean value of y (see Fig. 5.2a) 
leaves a lot of the variation in y unexplained. Once a model is 
fit, however, the unexplained variation is much reduced as the 
regression line accounts for some of the variation that is due 
to X (see Fig. 5.2b, c). Further, the assumption of normally 
distributed variables, often made in inferential theory, is also 
presumed for the distribution of the population of y values at 
each X value (see Fig. 5.3). Here, one notes that when slices 
of data are made at different values of x, the individual y dis- 
tributions are close to normal with equal variance. 



5.2.2 Ordinary Least Squares 



5.3 Simple OLS Regression 



Once a set of data is available, what is the best model which 
can be fit to the data. Consider the (x, y) set of n data po- 
ints shown in Fig. 5.1. The criterion for "best fit" should be 
objective, intuitively reasonable and relatively easy to im- 
plement mathematically. One would like to minimize the 
deviations of the points from the prospective regression line. 
The method most often used is the method of least squares 
where, as the name implies, the "best fit" line is interpreted 
as one which minimizes the sum of the squares of the residu- 
als. Since it is based on minimizing the squared deviations, 
it is also referred to as the Method of Moments Estimation 
(MME). The most common and widely used sub-class of 
least squares is the ordinary least squares (OLS) where, as 
shown in Fig. 5.1, squared sum of the vertical differences 
between the line and the observation points are minimized, 
i.e., min {D\ + ZJ^ + • ■ ■ + D'^^) . Another criterion for de- 
termining the best fit line could be to minimize the sum of 
the absolute deviations, i.e., min(|Di| + |£)2| + ■ ■ ■ | AiD- 
However, the mathematics to deal with absolute quanti- 
ties becomes cumbersome and restrictive, and that is why 
historically, the method of least squares was proposed and 
developed. Inferential statistics plays an important part in 



5.3.1 Traditional Simple Linear Regression 

Let us consider a simple linear model with two parameters, 
a and b, given by: 



a -|-b ■ X 



(5.1) 



The parameter 'a' denotes the model intercept, i.e., the 
value of y at x=0, while the parameter 'b' is the slope of the 
straight line represented by the simple model (see Fig. 5.4). 
The objective of the regression analysis is to determine the 
numerical values of the parameters a and b which result in 
the model given by Eq. 5.1 able to best explain the variation 
of y about its mean y as the numerical value of the regressor 
variable x changes. Note that the slope parameter b explains 
the variation in y due to that in x. It does not necessarily 
follow that this parameter accounts for more of the observed 
absolute magnitude in y than does the intercept parameter 
term a. For any y value, the total deviation can be partitioned 
into two pieces: explained and unexplained (recall the AN- 
OVA approach presented in Sect. 4.3 which is based on the 
same conceptual approach). Mathematically, 



Fig. 5.2 Conceptual illustration 
of how regression explains or 
reduces unexplained varia- 
tion in the response variable. 
It is important to note that the 
variation in the response variable 
is taken in reference to its mean 
value, a total variation (before 
regression), b explained variation 
(due to regression), c residual 
variation (after regression) 



Regression 
line 



Mean 
value 




5.3 Simple OLS Regression 
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These quantities are conceptually illustrated in Fig. 5.4. 
The sum of squares minimization implies that one wishes to 
minimize SSE, i.e. 

n n n 

I](y'-y>)' = E^y'-^-b^')' = E^' (5-3) 



i=\ 



i=\ 



! = 1 



where e is called the model residuals or error. 

From basic calculus, the model residuals are minimized 
when: 



9Ee 



9E' 



= and ^ = 

aa ab 



The above two equations lead to the following equations 
(called the normal equations): 



Fig. 5.3 Illustration of normally distributed errors with equal variances 
at different discrete slices of the regressor variable values. Normally 
distributed errors is one of the basic assumptions in OLS regression 
analysis. (From Schenck 1969 by permission of McGraw-Hill) 



E (y- - y)' = E ^y- - y-)' + E ^y- - y)' 

!=1 ;=1 ;=1 (5.2) 

or SST= SSE + SSR 

where 

y. is the individual response at observation i, 

y the mean value of y of the n observations, 

y, the value of y estimated from the regression model for 
observation i, 

SST = total sum of squares, 

SSE = error sum of squares or sum of the residuals which 
reflects the variation about the regression line (similar 
to Eq. 4.20 when dealing with ANOVA type of pro- 
blems), and 

SSR = regression sum of squares which reflects the amount 
of variation in y explained by the model (similar to tre- 
atment sum of squares of Eq. 4. 19). 



na + b 2_, X — E y ^"^^ 

^E'^+bE^'^E^y 



(5.4) 



where n is the number of observations. This leads to the fol- 
lowing expressions of the most "probable" OLS values of a 
andb: 

^ ^ "Ex,y.-(ExO(Ey.) ^ ^ (5.5,) 

n E xf - (E Xi) '^■« 



(Ey.)(Exf)-(Ex,yO(Ex.) 
nEx?-(ExO' 



-■y — h-x (5.5b) 



where 



(=1 

n 



(5.6) 



Fig. 5.4 The value of regression 
in reducing unexplained variation 
in the response variable as illus- 
trated by using a single observed 
point. The total variation from 
the mean of the response variable 
is partitioned into two portions: 
one that is explained by the 
regression model and the other 
which is the unexplained devia- 
tion, also referred to as model 
residual 
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Solids reduc- 
tion X (%) 


Chemical oxygen 
demand, y (%) 


Solids reduc- 
tion X (%) 


Chemical oxygen 
demand, y (%) 


3 


5 


36 


34 


7 


11 


37 


36 


11 


21 


38 


38 


15 


16 


39 


37 


18 


16 


39 


36 


27 


28 


39 


45 


29 


27 


40 


39 


30 


25 


41 


41 


30 


35 


42 


40 


31 


30 


42 


44 


31 


40 


43 


37 


32 


32 


44 


44 


33 


34 


45 


46 


33 


32 


46 


46 


34 


34 


47 


49 


36 


37 


50 


51 


36 


38 





Example 5.3.1:' Water pollution model between solids re- 
duction and chemical oxygen demand 
In an effort to determine a regression model between tanne- 
ry waste (expressed as solids reduction) and water pollution 
(expressed as chemical oxygen demand), sample data (33 
observation sets) shown in Table 5.1 were collected. Estima- 
te the parameters of a linear model. 

The regression line is estimated by first calculating the 
following quantities: 
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^x, = 1104, ^^, = 1124, 

33 33 

^x,- -J,- =41,355, ^xf =41,086 
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How the least squares regression model reduces the unex- 
plained variation in the response variable is conceptually il- 
lustrated in Fig. 5.4. b 
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Fig. 5.5 a Scatter plot of data b Plot of observed versus OLS model 
predicted values of the y variable 



Thus, the estimated regression line is: 

y = 3.8296 + 0.9036 • X 

The above data is plotted as a scatter plot in Fig. 5.5a. How 
well the regression model performs compared to the measu- 
rements is conveniently assessed from the observed vs pre- 
dicted plot such as Fig. 5.5b. Tighter scatter of the data points 
around the regression line indicates more accurate model fit. 
The regression line can be used for prediction purposes. 
The value of y at, say, x = 50 is simply: 

y = 3.8296 + (0.9036)(50) = 49 ■ 



Subsequently Eqs. 5.5a and b are used to compute: 

(33)(41,355)-(1104)(1124) 

b — ;; — = 0.9036 

(33)(41,086)- (1104)2 

1124 -(0.903643)(1 104) 



33 



3.8296 



5.3.2 Model Evaluation 

(a) The most widely used measure of model adequacy or 
goodness-of-fit is the coefficient of determination W where 
<R2 < 1: 



2 explained variation of y SSR 



total variation of y 



(5.7a) 



' From Walpole et al. (1998) by © permission of Pearson Education. 
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For a perfect fit R^=l, while R^=0 indicates that either the 
model is useless or that no relationship exists. For a univariate 
linear model, R^ is identical to the square of the Pearson corre- 
lation coefficient r (see Sect. 3.4.2). R^ is a misleading statistic 
if models with different number of regressor variables are to 
be compared. The reason for this is that R^ does not account 
for the number of degrees of freedom, it cannot but increase 
as additional variables are included in the model even if these 
variables have very little explicative power. 

(b) A more desirable goodness-of-fit measure is the cor- 
rected or adjusted R^, computed as 



R^ ^\-{\ - R^y 



1 



(5.7b) 



where n is the total number of observation sets, and k is the 
number of model parameters (for a simple linear model, 
k=2). 

Since R^ concerns itself with variances and not variation, 
this eliminates the incentive to include additional variables in a 
model which have little or no explicative power. Thus, R^ is the 
right measure to use during identification of a parsimonious^ 
model when multiple regressors are in contention. However, it 
should not be used to decide whether an intercept is to be added 
or not. For the intercept model, ^^ is the proportion of varia- 
bility measured by the sum of squares about the mean which 
is explained by the regression. Hence, for example, R^ — 0.92 
would imply that 92% of the variation in the dependent variable 
about its mean value is explained by the model. 

(c) Another widely used estimate of the magnitude of the 
absolute error of the model is the root mean square error 
(RMSE), defined as follows: 



RMSE 



SSE 



1/2 



(5.8a) 



where SSE is the sum of square error defined as 

SSE = ^ (J, - y,f = ^ (y, - a - b ■ x,)' . (5.8b) 

The RMSE is an absolute measure and its range is 
0<RMSE<oo. Its units are the same as those of the y variab- 
le. It is also referred to as ''standard error of the estimate". 

A normalized measure is often more appropriate: the co- 
efficient of variation of the RMSE (or CVRMSE or simply 
CV), defined as: 

RMSE 

(5.8c) 



CV ^ 



Hence, a CV value of say 12% implies that the root mean 
value of the unexplained variation in the dependent variable 

y is 12% of the mean value of y. 



^ Parsimony in the context of regression model building is a term deno- 
ting the most succinct model, i.e., one without any statistically super- 
fluous regressors. 



Note that the CV defined thus is based on absolute errors. 
Hence, it tends to place less emphasis on deviations between 
model predictions and observations which occur at lower nu- 
merical values of y than at the high end. Consequently, the 
measure may inadequately represent the goodness of fit of the 
model over the entire range of variation under certain circums- 
tances. An alternative definition of CV based on relative mean 
deviations is: 



CV* ^ \ 



1 



{n-k)^ 



E 



i = l . 



iyi - yd 

yi 



1/2 



(5.8d) 



If CV and CV* indices differ appreciably for a particular 
model, this would suggest that the model may be inadequate 
at the extreme range of variation of the response variable. 
Specifically, if CV*>CV, this would indicate that the model 
deviates more at the lower range, and vice versa. 

(d) The mean bias error (MBE) is defined as the mean 
difference between the actual data values and model predic- 
ted values: 



MBE 



E(y.- 

i=i 



-yO 



(5.9a) 



Note that when a model is identified by OLS, the model MBE 
of the original set of regressor variables used to identify the 
model should be zero (to within round-off errors of the com- 
puter). Only when, say, the model identified from a first set of 
observations is used to predict the value of the response va- 
riable under a second set of conditions will MBE be different 
than zero. Under such circumstances, the MBE is also called 
the mean simulation or prediction error A normalized MBE 
(or NMBE) is often used, and is defined as the MBE given by 
Eq. 5.9a divided by the mean value y: 



NMBE 



MBE 

y 



(5.9b) 



Competing models can be evaluated based on the CV and the 
NMBE values; i.e., those that have low CV and NMBE values. 
Under certain circumstances, one model may be preferable to 
another in terms of one index but not the other. The analysts is 
then perplexed as to which index to pick as the primary one. 
In such cases, the specific intent of how the model is going 
to be subsequently applied should be considered which may 
suggest the model selection criterion. 

While fitting regression models, there is the possibility of 
"overfitting", i.e., the model fits part of the noise in the data 
along with the system behavior. In such cases, the model is 
likely to have poor predictive ability which often the analyst 
is unaware of. A statistical index is defined later (Eq. 5.42) 
which can be used to screen against this possibility. A better 
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way to minimize this effect is to randomly partition the data 
set into two (say, in proportion of 80/20), use the 80% por- 
tion of the data to develop the model, calculate the internal 
predictive indices CV and NMBE (following Eqs. 5.8c and 
5.9b), use the 20% portion of the data and predict the y va- 
lues using the already identified model, and finally calculate 
the external or simulation indices CV and NMBE. The com- 
peting models can then be compared, and a selection made, 
based on both the internal and external predictive indices. 
The simulation indices will generally be poorer than the in- 
ternal predictive indices; larger discrepancies are suggestive 
of greater over-fitting, and vice versa. This method of mo- 
del evaluation which can avoid model over-fitting is referred 
to as holdout sample cross-validation or simply cross-vali- 
dation. Note, however, that though the same equations are 
used to compute the CV and NMBE indices, the degrees of 
freedom (df) are different. While df=n-k for computing the 
internal predictive errors where n is the number of obser- 
vations used for model building, df=m for computing the 
external indices where m is the number of observations in 
the cross-validation set. 

(e) The mean absolute deviation (MAD) is defined as the 
mean absolute difference between the actual data values and 
model predicted values: 



MAD: 



i=l 



(5.10) 



Example 5.3.2: Using the data from Example 5.3.1 repeat 
the exercise using your spreadsheet program. Calculate, R-, 
RMSE and CV values. 

From Eq. 5.2, SSE = 323.3 and SSR=3713.88. From this 
SST = SSEh-SSR=4037.2. 

Then from Eq. 5.7a, R^ = 92.0%, while from Eq. 5.8a, 
RMSE = 3.2295, from which CV=0.095 = 9.5%. ■ 



5.3.3 Inferences on Regression Coefficients 
and Model Significance 

Even after the overall regression model is found, one must 
guard against the fact that there may not be a significant rela- 
tionship between the response and the regressor variables, in 
which case the entire identification process becomes suspect. 
The F-statistic, which tests for significance of the overall re- 
gression model, is defined as: 



Thus, the smaller the value of F, the poorer the regression 
model. It will be noted that the F-statistic is directly related 
to R^ as follows: 



F = 



R2 n 
(1-R2) ' V 



k 



(5.12) 



Hence, the F-statistic can alternatively be viewed as being a 
measure to test the R- significance itself. In the case of uni- 
variate regression, the F-test is really the same as a t-test for 
the significance of the slope coefficient. In the general case, 
the F-test allows one to test the joint hypothesis of whether 
all coefficients of the regressor variables are equal to zero 
or not. 

Example 5.3.3: Calculate the F-statistic for the model iden- 
tified in Example 5.3.1. What can you conclude about the 
significance of the fitted model? From Eq. 5.11, 



/3713.8^ 
V 323.3 



33 



1 



356 



which clearly indicates that the overall regression fit is sig- 
nificant. The reader can verify that Eq. 5.12 also yields an 
identical value of F. ■ 

Note that the values of coefficients a and b based on the 
given sample of n observations are only estimates of the true 
model parameters a and /?. If the experiment is repeated over 
and over again, the estimates of a and b are likely to vary 
from one set of experimental observations to another. OLS 
estimation assumes that the model residual e is a random va- 
riable with zero mean. Further, it is assumed that the residu- 
als e at specific values of x are randomly distributed, which 
is akin to saying that the distributions shown in Fig. 5.3 at 
specific values of x are normal and have equal variance. 

After getting an overall picture of the regression model, it 
is useful to study the significance of each individual regres- 
sor on the overall statistical fit in the presence of all other 
regressors. The student t-statistic is widely used for this pur- 
pose and is applied to each regression parameter: 

For the slope parameter: 

b-1 



t: 



Sh 



(5.13a) 



where the estimated standard deviation of parameter "b" is 
Sb = RMSE/^fS7.. 

For the intercept parameter: 



t = 



a-0 



(5.13b) 



F = 



variance explained by the regression 
variance not explained by the regression 
SSR n - k 
'SSE ' k- 1 



(5.11) where the estimated standard deviation of parameter "a" is 

1/2 



RMSE. 
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where b and a are the estimated slope and intercept coef- 
ficients, P and a the hypothesized true values, and RMSE 
is given by Eq. 5.8a. Estimated standard deviations of the 
coefficients b and a, given by Eqs. 5.13a and b, are usually 
referred to as standard errors of the coefficients. Basically, 
the t-test as applied to regression model building is a for- 
mal statistical test to determine how significantly different 
an individual coefficient is from zero in the presence of the 
remaining coefficients. Stated simply, it enables an answer 
to the following question: would the fit become poorer if the 
regressor variable in question is not used in the model at all? 

The confidence intervals, assuming the model residuals to 
be normally distributed, are given by: 

For the slope: 

^ t„,2-RMSE ^^ ^^ ^ t^,2-RMSE (5 14^^ 



H„: a=0 



Using Eq. 5.13b, 



3.8296 - 



3.2295/^41,086/(33X4152.18) 



2.17 



with n-2=31 degrees of freedom. 

Again, one can reject the null hypothesis, and conclude 

that a ^0 at 95% CL. ■ 

Example 5.3.6: Find the 95% confidence interval for the 
slope term of the linear model identified in Example 5.3.1. 

Assuming a two-tailed test, tp(j5^2= 2.045 for 31 degrees of 
freedom. Therefore, the 95% confidence interval foryS given 
by Eq. 5.14a is: 



For the intercept: 




(5.14b) 



where t^^ is the value of the t distribution with df =(n-2) and 
S is defined by Eq. 5.6. 

Example 5.3.4: In Example 5.3.1, the estimated value of 
b = 0.9036. Test the hypothesis thatyS = 1.0 as against the al- 
ternative that < 1.0. 



(2.045)(3.2295) (Z045)(3^295) 

0.9036 Tr;— < p < 0.9036 H 

(4152.18)'/^ (4152.18) 

0.8011 <«< 1.0061 



1/2 



Example 5.3.7: Find the 95% confidence interval for the in- 
tercept term of the linear model identified in Example 5.3.1. 
Again, assuming a two-tailed test, and using Eq. 5.14b, 
the 95% confidence interval for a is: 



(2.045)(3.2295)V4T;086 

3.8296 rpi <a < 

[(33)(4152.18)]^/^ 

3.8296 +^^-^^^^^^-^^'^^^-^ 
[(33)(4152.18)]'/2 

0.2131 <a< 7.4461 



5.3.4 Model Prediction Uncertainty 



H„:yS=1.0 
Hj:yS<1.0 

From Eq. 5.6a, 
RMSE=3.2295 



t 



S =4152.1. Using Eq. 5.13a, with 



0.9036- 1.0 



3.2295/V4152.18 



-1.92 



with n - 2 = 3 1 degrees of freedom. 

From Table A.4, the one-sided critical t-value for 95% 
CL= 1.697. Since the computed t-value is greater than the 
critical value, one can reject the null hypothesis and conclu- 
de that there is strong evidence to support yS<l at the 95% 
confidence level. ■ 

Example 5.3.5: In Example 5.3.1, the estimated value of 
a=3.8296. Test the hypothesis that a =0 as against the alter- 
native that aitQ at the 95% confidence level. 



A regression equation can be used to predict future values of 
y provided the x value is within the domain of the original 
data from which the model was identified. One differentiates 
between the two types of predictions (similar to the confi- 
dence limits of the mean treated in Sect. 4.2. l.b): 

(a) mean response or standard error of regression where 
one would like to predict the mean value of y for a large 
number of repeated x^ values. The mean value is directly de- 
duced from the regression equation while the variance is: 



(r^{yo) = MSE ■ 



(xo 



(5.15) 



Note that the first term within the brackets, namely (MSE/n) 
is the standard error of the mean (see Eq. 4.2) while the other 
term is a result of the standard error of the slope coefficient. 
The latter has the effect of widening the uncertainty bands at 
either end of the range of variation of x. 
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(b) individual or specific response or standard error of 
prediction where one would like to predict the specific value 
of y for a specific value x^. This error is larger than the er- 
ror in the mean response by an amount equal to the RMSE. 
Thus, 



or, 



21.9025- (2.04X0.87793) <m(>'2o)< 21.9025 
+ (2.04)(0.87793) 

20. 1 12 < iiiyio) < 23.693 at 95% CL. 



(T^(%) = MSE 



1 (xo - x; 



z^2 



S.„ 



(5.16) 



Finally, the 95% CL for the individual response at level x^ 
is: 



yo = yo ± to.05/2 ■ cr(yo) 



(5.17) 



where t^^^^^ is the value of the t-student distribution at a sig- 
nificance level of 0.05 for a two-tailed error distribution. It is 
obvious that the prediction intervals for individual responses 
are wider than those of the mean response called confidence 
levels (see Fig. 5.6). Note that Eqs. 5.16 and 5.17 strictly ap- 
ply when the errors are normally distributed. 

Some texts state that the data set should be at least five to 
eight times larger than the number of model parameters to be 
identified. In case of short data sets, OLS may not yield ro- 
bust estimates of model uncertainty and resampling methods 
are advocated (see Sect. 10.6.2). 

Example 5.3.8: Calculate the 95% confidence limits (CL) 
for predicting the mean response for x=20. 

First, the regression model is used to calculate y^ at 
x„=20: 

3^0 = 3.8296 + (0.9036)(20) = 21.9025 
Using Eq. 5.15, 



a(yo) = (3.2295) 



1 (20 - 33.4545)-^ 
33 "*" 4152.18 



1/2 



= 0.87793 



Further, from Table A.4, t 



= 2.04 for d.f = 33-2 = 31. 



Using Eq. 5.15 yields the confidence interval for the mean 
response 



Example 5.3.9: Calculate the 95% prediction limits (PL) 
for predicting the individual response for x=20. 
Using Eq. 5.16, 



a(%) = (3.2295) 



1 (20 - 33.4545)^ 

IH h ^ - 

33 4152.18 



1/2 



3.3467 



Further, t(jgj,2 = 2.04. Using Eq. 5.17 yields 

21.9025 - (2.04)(3.3467) <y2o < 
21.9025 + (2.04)(3.3467) 



or 



15.075 <j)2o< 28.730. 



5.4 Multiple OLS Regression 



Regression models can be classified as: 



(i) 



(ii) 



single variate or multivariate, depending on whether 
only one or several regressor variables are being consi- 
dered; 

single equation or multi-equation depending on whet- 
her only one or several response variables are being 
considered; and 
(iii) linear or non-linear, depending on whether the mo- 
del is linear or non-linear in its function. Note that 
the distinction is with respect to the parameters (and 
not its variables). Thus, a regression equation such as 
y=aH-b-xH-c-x^ is said to be linear in its parameters 
{ a, b, c } though it is non-linear in the regressor variable 



Fig. 5.6 95% confidence inter- 
vals and 95% prediction intervals 
about the regression line 
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X (see Sect. 1.2.3 for a discussion on classification of 

mathematical models). 
Certain simple single variate equation models are shown 
in Fig. 5.7. Frame (a) depicts simple linear models (one with 
a positive slope and another with a negative slope), while (b) 
and (c) are higher order polynomial models which, though 
non-linear in the function, are models linear in their parame- 
ters. The other figures depict non-linear models. Because of 
the relative ease in linear model building, data analysts often 
formulate a linear model even if the relationship of the data 
is not strictly linear. If a function such as that shown in frame 
(d) is globally non-linear, and if the domain of the experi- 
ment is limited say to the right knee of the curve (bounded by 



points c and d), then a linear function in this region could be 
postulated. Models tend to be preferentially framed as linear 
ones largely due to the simplicity in the subsequent analysis 
and the prevalence of solution methods based on matrix al- 
gebra. 



5.4.1 Higher Order Linear Models: Polynomial, 
Multivariate 

When more than one regressor variable is known to influence 
the response variable, a multivariate model will explain more 
of the variation and provide better predictions than a single 



Fig. 5.7 General shape of re- 
gression curves. (From Shannon 
1975 by © permission of Pearson 
Education) 




a^ >0 



ai <0 



y = ao + a^ log x 




a^ <0 



y = aQ + a^x + 82x2 
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Y 



y = ag + a^x + 82x2 + 83x2 



83 <0 



log y = ao + ai log x 
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variate model. The parameters of such a model need to be 
identified using multiple regression techniques. This section 
will discuss certain important issues regarding multivariate, 
single-equation models linear in the parameters. For now, the 
treatment is limited to regressors which are uncorrelated or 
independent. Consider a data set of n readings that include k 
regressor variables. The corresponding form, called the ad- 
ditive multiple linear regression model, is: 



y = /So + P\X\ + hxi H h PkXk + £ 



(5.18a) 



where e is the error or unexplained variation in y. Note the 
lack of any interaction terms, and hence the term "additive". 
The simple interpretation of the model parameters is that p. 
measures the unit influence of x. on y (i.e., denotes the slope 
j^). Note that this is strictly true only when the variables 
are really independent or uncorrelated, which, often, they are 
not. 

The same model formulation is equally valid for a k-th 
degree polynomial regression model which is a special case 
of Eq. 5.18a with x =x, x,=x- ... 



curves depending on the value of yS (i.e., either positive or ne- 
gative). Multivariate model development utilizes some of the 
same techniques as discussed in the single variable case. The 
first step is to identify all variables that can influence the re- 
sponse as predictor variables. It is the analyst's responsibility 
to identify these potential predictor variables based on his or 
her knowledge of the physical system. It is then possible to 
plot the response against all possible predictor variables in 
an effort to identify any obvious trends. The greatest sing- 
le disadvantage to this approach is the sheer labor involved 
when the number of possible predictor variables is high. 

A situation that arises in multivariate regression is the con- 
cept of variable synergy, or commonly called interaction bet- 
ween variables (this is a consideration in other problems; for 
example, when dealing with design of experiments). This oc- 
curs when two or more variables interact and impact system 
response to a degree greater than when the variables operate 
independently. In such a case, the first-order linear model 
with two interacting regressor variables takes the form: 



y 



% + P\xi 



32-^2 



^3X1 ■ X2 + S 



(5.20) 



y^PQ + Pix + hx + ■ • ■ + hx + e 



(5.19) 



Let x. denote the i"" observation of parameter j. Then Eq. 
5.18a can be re- written as 



yi = ySo + P\Xi\ + P2X12 + 



PkXik 



(5.18b) 



Often, it is most convenient to consider the "normal" trans- 
formation where the regressor variables are expressed as a 
difference from the mean (the reason why this form is im- 
portant will be discussed in Sect. 6.3 while dealing with 
experimental design methods). Specifically, Eq. 5.18a trans- 
forms into 



y ^Po' + ySi(xi - xi) + ;62(X2 
H h Pk(xk - Xk) + e 



■X2) 



(5.18c) 



An important special case is the quadratic regression model 
when k=2. The straight line is now replaced by parabolic 



How the interaction parameter affects the shape of the fa- 
mily of curves is illustrated in Fig. 5.8. The origin of this 
model function is easy to derive. The lines for different 
values of regressor x^ are essentially parallel, and so the 
slope terms for both models are equal. Let the model with 
the first regressor be: y — a' + bxi, while the intercept be 
given by: a' — f{x2) — a -h CX2. Combining both equa- 
tions results in: y = a -\- bxi -\- cx2- This corresponds 
to Fig. 5.8a. For the interaction case, both the slope and 
the intercept terms are function of x^. Hence, representing 
a' = a -f bxi and b' — c -{- dx\, then: 

y — a -\- bx\ -\- {c -\- dx\)x2 — a -\- bx\ -\- CX2 -|- dx\X2 

which is identical in structure to Eq. 5.20. 

Simple linear functions have been assumed above. It is 
straightforward to derive expressions for higher order mo- 
dels by analogy. For example, the second-order (or quadra- 
tic) model without interacting variables is: 



Fig. 5.8 Plots illustrating the 
effect of interaction among the 
regressor variables, a Non-inter- 
acting, b Interacting 
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y^pa + jSixi + ^2X2 + A^f + pAxl + s (5.21) 

For a second order model with interacting terms, the corre- and 
sponding expression can be derived as follows: 

Consider the linear polynomial model with one regressor: 



}> = ^0 + bixx +b2Xi' 



(5.22) 



If the parameters [b^^, b^, b^} can themselves be expressed 
as second-order polynomials of another regressor x^, the full 
model which has nine regression parameters is: 



y — boo + bioxi + boiX2 + buxiX2 

+ b20J:^ + ^02-^2 + b2\xlx2 



■bx2X\X2 



■ b22xlxl . 



The most general additive model, which imposes little struc- 
ture to the relationship is given by: 

y^po + Mxi) + fi{x2) + ■■■ + Uxt) + e (5.24) 

where the form of/ (x) are unspecified. 

Note that synergistic behavior can result in two or more 
variables working together to "overpower" another variab- 
le's prediction capability. As a result, it is necessary to al- 
ways check the importance (the relative value of either the 
t- or F-values) of each individual predictor variable while 
performing multivariate regression. Those variables with t- 
or F-values that are insignificant should be omitted from the 
model and the remaining predictors used to estimate the mo- 
del parameters. The stepwise regression method described in 
Sect. 5.7.4 is based on this approach. 



5.4.2 Matrix Formulation 

When dealing with multiple regression, it is advantageous 
to resort to matrix algebra because of the compactness and 
ease of manipulation it offers without loss in clarity. Though 
the solution is conveniently provided by a computer, a basic 
understanding of matrix formulation is nonetheless useful. 
In matrix notation (with y' denoting the transpose of y), the 
linear model given by Eq. 5.18 can be expressed as follows 
(with the matrix dimension shown in subscripted brackets): 



Y(n,l) = X(n.p)/8(p4) + £(n.l) 



(5.25) 



where p is the number of parameters in the model =kH- 1 (for 

a linear model), n is the number of observations 

and 



Y'^lyiyi.-.ynl p'^Wopi...M, (5.26a) 

S' = [Si £2 . . . £n] 



x = 



xu 

X2\ 
X-nX 



Xlk 



■^nk 



(5.26b) 



The descriptive measures applicable for a single variable can 
be extended to multivariables of order p (= k-n 1), and written 
in compact matrix notation. 



(5.23) 5.4.3 OLS Parameter Identification 



The approach involving minimization of SSE for the uni- 
variate case (Sect. 5.3.1) can be generalized to multivariate 
linear regression. Here, the parameter set ji is to be identified 
such that the sum of squares function L is minimized: 

n 

L=J]e2^e'e = (Y-X/J)'(Y-X/i) (5.27) 
i=i 



or. 



aL 



-2X'Y + 2X'X;S = 
op 

which leads to the system of normal equations 



From here. 



X'X^ = X'Y. 



b = (X'X) 'x'Y 



(5.28) 



(5.29) 



(5.30) 



provided matrix X is not singular and where b is the least 
square estimator matrix of yS. 

Note that X'X is called the variance-covariance matrix 
of the estimated regression coefficients. It is a symmetrical 
matrix with the main diagonal elements being the sum of 
squares of the elements in the columns of X (i.e., the vari- 
ances) and the off-diagonal elements being the sum of the 
cross-products (i.e., the covariances). Specifically, 



XX = 



n 


EXil 


Exik 


n 

Exii 
i=i 


n 

Ex?i ■ 


n 
• E Xil • Xik 

i=i 


n 

Exik 


n 

E Xik • Xii ■ 


• Ex,i 


i=l 


i=l 


i=i 



(5.31) 
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5 Estimation of Linear Model Parameters Using Least Squares 



Under OLS regression, b is an unbiased estimator of yS with (b) Using the data given in the table, identify the model and 



the variance-covariance matrix var(b) given by: 



var(fe) = cr^CX'X)"' 



(5.32) 



where a^ is the mean square error of the model error terms 



= (sum of square errors)/(n - p) 
An unbiased estimator of cr^ is s^, where 



e'e y'y — b'x'y 



SSE 



(5.33) 



(5.34) 



For predictions within the range of variation of the original 
data, the mean and individual response values are normally 
distributed with the variance given by the following: 
(a) For the mean response at a specific set of x^ values, cal- 
led the confidence level, under OLS 



var(yo) = s" [Xo(X'X)-'x;] 



(5.35) 



(b) The variance of an individual prediction, called the pre- 
diction level, is 



var(yo) = ^' [l + Xo(X'X) 'x^] (5.36) 



where 1 is a column vector of unity. 
Confidence limits at a significance level a are: 

yo ± t(n - k, all) ■ vari/2(yo) 



(5.37) 



Example 5.4.1: Part load performance of fans {and pumps) 
Part-load performance curves do not follow the idealized fan 
laws due to various irreversible losses. For example, decrea- 
sing the flow rate by half of the rated flow does not result in a 
l/8th decrease in its rated power consumption. Hence, actual 
tests are performed for such equipment under different levels 
of loading. The performance tests of the flow rate and the 
power consumed are then normalized by the rated or 100% 
load conditions called part load ratio (PLR) and fractional 
full-load power (FFLP) respectively. Polynomial models can 
then be fit between these two quantities. Data assembled in 
Table 5.2 were obtained from laboratory tests on a variable 
speed drive (VSD) control which is a very energy efficient 
control option. 

(a) What is the matrix X in this case if a second or- 
der polynomial model is to be identified of the form 
y = ;6o + ySiXi + P2x\ ? 



Table 5.2 Data table for Example 5.4.1 










PLR 0.2 0.3 0.4 0.5 0.6 


0.7 


0.8 


0.9 


1.0 


FFLP 0.05 0.11 0.19 0.28 0.39 


0.51 


0.68 


0.84 


1.00 



report relevant statistics on both parameters and overall 
model fit. 
(c) Compute the confidence bands and the prediction bands 
at 0.05 significance level for the response at values of 
PLR=0.2 and 1.00 (i.e., the extreme points). 

Solution 

(a) The independent variable matrix X given by Eq. 5.26b is: 



X = 





0.2 


0.05 




0.3 


0.11 




0.4 


0.19 




0.5 


0.28 




0.6 


0.39 




0.7 


0.51 




0.8 


0.68 




0.9 


0.84 




1 


1 



(b) The results of the regression are shown below: 



Parameter 


Estimate 


Standard error 


t-statistic 


P-value 


CONSTANT 


-0.0204762 


-0.0173104 


-1.18288 


0.2816 


PLR 


0.179221 


0.0643413 


2.78547 


0.0318 


PLR'\2 


0.850649 


0.0526868 


16.1454 


0.0000 



Analysis of Variance 



Source 


Sum of squares 


Df 


Mean square 


F-ratio 


P-value 


Model 


0.886287 


2 


0.443144 


5183.10 


0.0000 


Residual 


0.000512987 


6 


0.0000854978 






Total 
(Corr.) 


0.8868 


8 









Goodness-of-fitR'=99.9%,AdjustedR-=99.9%,RMSE= 
0.009246 

Mean absolute error (MAD) = 0.00584. The equation of 
the fitted model is (with appropriate rounding) 

FFLP = -0.0205 + 0.1792*PLR + 0.8506*PLR2 

Since the P-value in the ANOVA table is less than 0.05, the- 
re is a statistically significant relationship between FFLP and 
PLR at the 95% confidence level. However, the p-value of the 
constant term is large, and a model without an intercept term 
is probably more appropriate; thus, such an analysis ought to 
be performed, and its results evaluated. The values shown are 
those provided by the software package. There are too many 
significant decimals, and so the analyst should round these off 
appropriately while reporting the results (as shown above). 

(c) The 95% confidence and the prediction intervals are 
shown in Fig. 5.9. Because the fit is excellent, these are very 
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Fig. 5.9 Plot of fitted model with 95% CL and 95% PL bands 

narrow and close to each other. The predicted values as well 
as the 95% CL and PL for the two data points are given in 
the table below. Note that the uncertainty range is relatively 
much larger at the lower value than at the higher range. 



Predicted 


95% Prediction Limits 
Lower Upper 
0.0202378 0.0785501 


95% Confidence Limits 


X y 


Lower 


Upper 


0.2 0.0493939 


0.0310045 


0.0677834 


1.0 1.00939 


0.980238 1.03855 


0.991005 


1.02778 



Example 5.4.2: Table 5.3 gives the solubility of oxygen in 
water in (mg/L) at 1 atm pressure for different temperatures 
and different chloride concentrations in (mg/L). 

(a) Plot the data and formulate two different models to be 
evaluated 

(b) Evaluate both models and identify the better one. Give 
justification for your choice 

(c) Report pertinent statistics for model parameters as well 
as overall model fit 

(a) The above data is plotted in Fig. 5.10a. One notes that 
the series of plots are slightly non-linear but parallel 
suggesting a higher order model without interaction 
terms. Hence, first order and second order polynomial 
models without interaction are logical models to inves- 
tigate. 



Table 5.3 Solubility of oxygen in water (m 
chloride concentration 


g/L) with temperature and 


Temperature (°C) Chloride concentration 


in water (mg/L) 





5,000 


10,000 


15,000 


14.62 


13.73 


12.89 


12.10 


5 12.77 


12.02 


11.32 


10.66 


10 11.29 


10.66 


10.06 


9.49 


15 10.08 


9.54 


9.03 


8.54 


20 9.09 


8.62 


8.17 


7.75 


25 8.26 


7.85 


7.46 


7.08 


30 7.56 


7.19 


6.85 


6.51 



20 



i^ 15 



o 
CO 



0) 



10 











A 


— Cone 


= 1 

= 5000 
=10000 




^ 




^ 


/> 


/ 












^fe 


^fe 


^ 



















10 15 
Temperature 



20 



25 



30 



















4 

I! „ 










■ 




■ 


■D 2 
■D 




t 

^ 










" 


.N 






' ' * . 


■ ■ 






■ 


5 -2 

CO 














■ 


















b 




6.5 8.5 10.5 12.5 14.5 16.5 
predicted Solubility 


4 

% 2 
2 

!> 

N 
■+= 
C 
0) 

1-2 

CO 

-4 
























■ 




- 




■ B 






. 




" 






- ■ 




■ 




- 














- 



















6.5 8.5 10.5 12.5 

predicted Solubility 



14.5 



16.5 



Fig. 5.10 a Plot of data, b Residual pattern for the first order model, c 
Residual pattern for the second order model 



(bl) Analysis results of the first order model without in- 
teraction term: 

R'=96.83%, Adjusted R^ = 96.57%,RMSE=0.41318 



Parameter 


Estimate 




Standard error 


t-statistic 


P-value 


CONSTANT 


13.6111 




0.175471 


77.5686 


0.0000 


Chloride 
Concentration 


-0.000109857 


0.000013968 


-7.86489 


0.0000 


Temperature 


-0.206786 




0.00780837 


-26.4826 


0.0000 


Analysis of Variance 


Source 


Sum of 
squares 


Df 


Mean 
square 


F-ratio 


P-value 


Model 


130.289 


2 


65.1445 


381.59 


0.0000 


Residual 


4.26795 


25 


0.170718 






Total (Corr.) 


134.557 


27 
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5 Estimation of Linear Model Parameters Using Least Squares 



The equation of the fitted model is: 

Solubility= 13.61 1 1-0.000109857 * Chloride Concen- 
tration - 0.206786 * Temperature 

The model has excellent R^ with all coefficients being sta- 
tistically significant, but the model residuals are very ill-be- 
haved since a distinct pattern can be seen (Fig. 5.10b). This 
issue of how model residuals can provide diagnostic insights 
into model building will be explored in detail in Sect. 5.6. 

(b2) Analysis results for the second order model without 
interaction term: 

The OLS regression results in R-=99.26%, Adjus- 
ted R2=99.13%, RMSE=0.20864, Mean absolute er- 
ror=0. 14367. This model is distinctly better with higher R^ 
and lower RMSE. Except for one term (the square of the 
concentration), all parameters are statistically significant. 
The residual pattern is less distinct, but the residuals are still 
patterned (Fig. 5.10c). It would be advisable to investiga- 
te other functional forms, probably non-linear or based on 
some mechanistic insights. 



Parameter 


Estimate 




Standard error 


t-statistic 


P-value 


CONSTANT 


14.1183 




0.112448 


125.554 


0.0000 


Temperature 


-0.325 




0.0142164 


-22.8609 


0.0000 


Chloride 
concentration 


-0.000118643 


0.0000246866 


-4.80596 


0.0001 


Temperature'^2 0.00394048 


0.000455289 


8.65489 


0.0000 


Chloride con- 
centration'^2 


5.85714E-10 


1.57717E-9 


0.371371 


0.7138 


Analysis of Variance 










Source 


Sum of 
squares 


Df 


Mean square 


F-ratio 


P-value 


Model 


133.556 


4 


33.3889 


767.02 


0.0000 


Residual 


1.0012 


23 


0.0435305 






Total (Corr.) 


134.557 


27 









5.4.4 Partial Correlation Coefficients 

The simple correlation coefficient between two variables has 
already been introduced previously (Sect. 3.4.2). Consider 
the multivariate linear regression (MLR) model given by 
Eq. 5.18. If the regressors are uncorrelated, then the simple 
correlation coefficients provide a direct indication of the in- 
fluence of the individual regressors on the response variable. 
Since regressors are often "somewhat" correlated, the con- 
cept of the simple correlation coefficient can be modified to 
handle such interactions. This leads to the concept of partial 
correlation coefficients. Assume a MLR model with only 
two regressors: x^ and x^. The procedure to compute the par- 
tial correlation coefficient '">..i, between y and x^ will make 
the concept clear: 

Step 1 : Regress y vs x^ so as to identify a prediction model 
fory 



Step 2: Regress x^ vs x^ so as to identify a prediction model 

for xi 
Step 3 : Compute new variables (in essence, the model resi- 
duals): y* — y — y and x'l — xi — ii 
Step 4: The partial correlation r,. v, between y and x^ is the 
simple correlation coefficient between y* and xi* 
Note that the above procedure allows the linear influence 
of Xj to be removed from both y and x^, thereby enabling the 
partial correlation coefficient to describe only the effect of 
x^ on y which is not accounted for by the other variables in 
the model. This concept plays a major role in the process of 
stepwise model identification described in Sect. 5.7.4. 



5.4.5 Beta Coefficients and Elasticity 

Beta coefficients fi" are occasionally used to make statements 
about the relative importance of the regressor variables in 
a multiple regression model (Pindyck and Rubinfeld 1981). 
These coefficients are the parameters of a linear regression 
model with each variable normalized by subtracting its mean 
and dividing by its standard deviation: 



y -y 



^X] — X] 

'1 

cr.vi 



'2 



■ X2 



cr.v2 



+ ■■■8 



(5.38) 



or 



y* ^ P;xi* + P*X2* 



TheyS' matrix can be directly deduced from the original slope 
parameter "b" of the un-normalized MLR model as: 

P* ^b-— (5.39) 

ay 

For example, the beta coefficient/? =0.7 can be interpreted to 
mean that one standard deviation in the regressor variable le- 
ads to a 0.7 standard deviation in the dependent variable. For 
a two-variable model, yS* is the simple correlation between 
the two variables. The rescaling associated with the normal- 
ized regression makes it possible to compare the individual 
values of y9' directly, i.e., the relative importance of the diffe- 
rent regressors can be directly evaluated against each other, 
provided the regressors are uncorrelated with each other. A 
variable with a high yS* coefficient should account for more 
of the variance in the response variable (variance is not to be 
confused with contribution). The square of theyS' weights are 
indicative of the relative effects of the respective variables on 
the variation of the response variable. 

The beta coefficients indicate or represent the marginal 
effect of the standardized regressors on the standardized 
response variable. Often, one is interested in deducing the 
effect of a fractional (or percentage) change of a regressor j 
on the dependent variable. This is provided by the elasticity 



5.4 Multiple OLS Regression 



155 



Fig. 5.1 1 Sketch of a flooded- 
type centrifugal chiller with two 
water loops showing the various 
regressors often used to develop 
the performance model for COP 
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of y with respect to say x. which is usually evaluated at their 
mean values as: 



y y xj 



(5.40) 



Elasticities can take both positive or negative values. Large 
values of elasticity imply that the regressor variable is very 
responsive to changes in the regressor variables. For non-li- 
near functions, elasticities can also be calculated at the point 
of interest rather than at the mean point. The interpretation 
of elasticities is straightforward. If E = 1.5, this implies that a 
1% increase in the mean of the regressor variable will result 
in a 1.5% increase in y. 

Example 5.4.3: Beta coefficients for ascertaining import- 
ance of driving variables for chiller thermal performance 

The thermal performance of a centrifugal chiller is cha- 
racterized by the Coefficient of Performance (COP) which is 
the dimensionless ratio of the cooling thermal capacity (Q^^) 
and the compressor electric power (P ) in consistent units. 
A commonly used performance model for the COP is one 
with three regressors, namely the cooling load Q^,^, the con- 
denser inlet temperature T^^. and chiller leaving temperature 
T^i^^ (see Fig. 5.11). The condenser and evaporator tempera- 
tures shown are those of the refrigerant as it changes phase. 

A data set of 107 performance points from an actual 
chiller was obtained whose summary statistics are shown in 



the table below. An OLS regression yielded a model with 
R^=90.1% whose slope coefficients b are also shown in 
Table 5.4 along with the beta coefficients and the elasticity 
computed from Eqs. 5.39 and 5.40 respectively. One would 
conclude looking at the elasticity values that T^^. has the most 
influence on COP followed by Q^j^, while that of T^^^^ is very 
small. A 1% increase in Q^^^ increases COP by 0.431% while 
a 1% increase in T^^. would decrease COP by 0.603%. The 
beta coefficients, on the other hand, take into account the 
range of variation of the variables. For example, the load va- 
riable Q^^ can change from 20 to 100% while T^^j usually 
changes only by 15°C or so. Thus, beta coefficients express 
the change in the COP of 0.839 in terms of one standard de- 
viation change in Q^^^ (i.e., a load change of 88.1 kW) while a 
comparable one standard deviation change in T^^. (of 4.28°C) 
would result in a decrease of 0.496 in COP. ■ 

Table 5.4 Associated statistics of the four variables, results of the OLS 
regression and beta coefficients 







Response 
COP 


Regressors 








Q,, (kW) 


T.,(°C) 


T.„(°C) 


Mean 




3.66 


205.8 


23.66 


7.37 


St. dev 




0.806 


88.09 


4.283 


2.298 


Min 




2.37 


86 


16.01 


3.98 


Max 




4.98 


361.4 


29.95 


10.94 


Slope coeff. b 




0.0077 


-0.0933 


0.0354 


beta„coeff. 


(Eq. 5.39) 




0.839 


-0.496 


0.101 



Elasticity (Eq. 5.40) 



0.431 



-0.603 



0.071 
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5.5 Assumptions and Sources of Error During 
OLS Parameter Estimation 

5.5.1 Assumptions 

The ordinary least square (OLS) regression method: 
(i) enables simple or multiple linear regression models to 
be identified fi-om data, which can then be used for fix- 
ture prediction of the response variable along with its 
uncertainty bands, and 
(ii) allows statistical statements to be made about the esti- 
mated model parameters. 
No statistical assumptions are used to obtain the OLS es- 
timators for the model coefficients. When nothing is known 
regarding measurement errors, OLS is often the best choice 
for estimating the parameters. However, in order to make sta- 
tistical statements about these estimators and the model pre- 
dictions, it is necessary to acquire information regarding the 
measurement errors. Ideally, one would like the error terms 
e to be normally distributed, without serial correlation, with 
mean zero and constant variance. The implications of each of 
these four assumptions, as well as a few additional ones, will 
be briefly addressed below since some of these violations 
may lead to biased coefficient estimates as well as distorted 
estimates of the standard errors, confidence intervals, and 
statistical tests. 

(a) Errors should have zero mean: If this is not true, the 
OLS estimator of the intercept will be biased. The im- 
pact of this assumption not being correct is generally 
viewed as the least critical among the various assump- 
tions. Mathematically, this implies that expected value 
E(e)=0. 

(b) Errors should be normally distributed: If this is not true, 
statistical tests and confidence intervals are incorrect for 
small samples though the OLS coefficient estimates are 
unbiased. Figure 5.3 which illustrates this behavior has 
already been discussed. This problem can be avoided 
by having large samples, and verifying that the model is 
properly specified. 

(c) Errors should have constant variance: This violation 
of the basic OLS assumption results in increasing the 
standard errors of the estimates and widening the model 
prediction confidence intervals (though the OLS esti- 
mates themselves are unbiased). In this sense, there is 
a loss in statistical power. This condition is expressed 
mathematically as, var {y)=a-. This issue is discussed 
further in Sect. 5.6.3. 

(d) Errors should not be serially correlated: This violation 
is equivalent to have less independent data, and also 
results in a loss in statistical power with the same con- 
sequences as (c) above. Serial correlations may occur 
due to the manner in which the experiment is carried 



out. Extraneous factors, i.e., factors beyond our control 
(such as the weather, for example) may leave little or 
no choice as to how the experiments are executed. An 
example of a reversible experiment is the classic pipe- 
friction experiment where the flow through a pipe is 
varied so as to cover both laminar and turbulent flows, 
and the associated friction drops are observed. Gradu- 
ally increasing the flow one way (or decreasing it the 
other way) may introduce biases in the data which will 
subsequently also bias the model parameter estimates. 
In other circumstances, certain experiments are irrever- 
sible. For example, the loading on a steel sample to pro- 
duce a stress-strain plot has to be performed by gradu- 
ally increasing the loading till the sample breaks, one 
cannot proceed in the other direction. Usually the biases 
brought about by the test sequence are small, and this 
may not be crucial. In mathematical terms, this condi- 
tion,forafirstordercase,canbewrittenasE(e,-.e,+i) = 0. 
This assumption, which is said to be hardest to verify, is 
further discussed in Sect. 5.6.4. 

(e) Errors should be uncorrelated with the regressors: 
The consequences of this violation result in OLS co- 
efficient estimates being biased and the predicted OLS 
confidence intervals understated, i.e., narrower. This 
violation is a very important one, and is often due to 
"mis-specification error" or underfitting. Omission of 
influential regressor variables and improper model for- 
mulation (assuming a linear relationship when it is not) 
are likely causes. This issue is discussed at more length 
in Sect. 10.4.1. 

(f) Regressors should not have any measurement error: 
Violation of this assumption in some (or all) regressors 
will result in biased OLS coefficient estimates for those 
(or all) regressors. The model can be used for prediction 
but the confidence limits will be understated. Strictly 
speaking, this assumption is hardly ever satisfied sin- 
ce there is always some measurement error. However, 
in most engineering studies, measurement errors in the 
regressors are not large compared to the random errors 
in the response, and so this violation may not have im- 
portant consequences. As a rough rule of thumb, this 
violation becomes important when the errors in x re- 
ach about a fifth of the random errors in y, and when 
multi-collinearity is present. If the errors in x are 
known, there are procedures which allow unbiased co- 
efficient estimates to be determined (see Sect. 10.4.2). 
Mathematically, this condition is expressed as 
var (x.) = 0. 

(g) Regressor variables should be independent of each ot- 
her: This violation applies to models identified by mul- 
tiple regression when the regressor variables are corre- 
lated among each other (called multicollinearity). This 
is true even if the model provides an excellent fit to the 
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data. Estimated regression coefficients, thougii unbia- 
sed, will tend to be unstable (their values tend to change 
greatly when a data point is dropped or added), and the 
OLS standard errors and the prediction intervals will 
be understated. Multicollinearity is likely to be problem 
only when one (or more) of the correlation coefficients 
among the regressors exceeds 0.85 or so. Sect. 10.3 
deals with this issue at more length. 



parameter estimation process except when model R^ is 
very high (R^>0.9). This issue is further discussed in 
Sect. 5.6. Formal statistical procedures do not explicitly 
treat this case but limit themselves to type (a) errors and 
more specifically to case (i) assuming purely additive or 
multiplicative errors. The implicit assumptions in OLS 
and their implications, if violated, are described below. 



5.5.2 Sources of Errors During Regression 

Perhaps the most crucial issue during parameter identificati- 
on is the type of measurement inaccuracy present. This has a 
direct influence on the estimation method to be used. Though 
statistical theory has more or less neatly classified this beha- 
vior into a finite number of groups, the data analyst is often 
stymied by data which does not fit into any one category. 
Remedial action advocated does not seem to entirely remove 
the adverse data conditioning. A certain amount of experien- 
ce is required to surmount this type of adversity, which, furt- 
her, is circumstance-specific. As discussed earlier, there can 
be two types of errors: 

(a) measurement error. The following sub-cases can be 
identified depending on whether the error occurs: 
(i) in the dependent variable, in which case the model 
form is: 



(b) 



y, + 5i=;So + Axi 



(5.41a) 



(ii) in the regressor variable, in which case the model 
form is: 

y, = A + /ii(x, + /,) (5.41b) 

(iii) in both dependent and regressor variables: 

yi + 5i = y6o + y6i(x, + K,) (5.41c) 

Further, the errors S and y (which will be jointly repre- 
sented by e) can have an additive error, in which case, 
£i 7^ f(yi> Xi), or a multiplicative error: Si — f(yi, Xi), 
or worst still, a combination of both. Section 10.4.1 di- 
scusses this issue further. 

model misspecification error. How this would affect 
the model residuals e, is difficult to predict, and is ex- 
tremely circumstance-specific. Misspecification could 
be due to several factors, for example, one or more im- 
portant variables have been left out of the model, or the 
functional form of the model is incorrect. Even if the 
physics of the phenomenon or of the system is well un- 
derstood and can be cast in mathematical terms, iden- 
tifiability constraints may require that a simplified or 
macroscopic model be used for parameter identification 
rather than the detailed model (see Sect. 10.2). This is 
likely to introduce both bias and random noise in the 



5.6 Model Residual Analysis^ 

5.6.1 Detection of lil-Conditioned Model 
Residual Behavior 

The availability of statistical software has resulted in routine 
and easy application of OLS to multiple linear models. Ho- 
wever, there are several underlying assumptions that affect 
the individual parameter estimates of the model as well as 
the overall model itself. Once a model has been identified, 
the general tendency of the analyst is to hasten and use the 
model for whatever purpose intended. However, it is extre- 
mely important (and this phase in often overlooked) that an 
assessment of the model be done to determine whether the 
OLS assumptions are met, otherwise the model is likely to 
be deficient or misspecified, and yield misleading results. In 
the last few decades, there has been much progress made on 
how to screen model residual behavior so as to provide dia- 
gnostics insight into model deficiency or misspecification. 

A few idealized plots illustrate some basic patterns of im- 
proper residual behavior which are addressed in more detail 
in the later sections of this chapter. Figure 5.12 illustrates the 
effect of omitting an important dependence which suggests 
that an additional variable is to be introduced in the model 



Model 
Residuals 




Fig. 5.12 The residuals can be separated into two distinct groups 
(shown as crosses and dots) which suggest that the response variable 
is related to another regressor not considered in the regression model. 
This residual pattern can be overcome by reformulating the model by 
including this additional variable. One example of such a time-based 
event system change is shown in Fig. 9.15 of Chap. 9. 



' Herschel: "... almost all of the greatest discoveries in astronomy have 
resulted from the consideration of what . . . (was) termed residual phe- 
nomena". 
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Fig. 5.13 Outliers indicated by crosses suggest that data should be che- 
cked and/or robust regression used instead of OLS 
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Fig. 5.16 Serial correlation is indicated by a pattern in the residuals 
when plotted in the sequence the data was collected, i.e., when plotted 
against time even though time may not be a regressor in the model 



which distinguishes between the two groups. The presence 
of outhers and the need for more robust regression schemes 
which are immune to such outliers is illustrated in Fig. 5.13. 
The presence of non-constant variance (or heteroscedastici- 
ty) in the residuals is a very common violation and one of 
several possible manifestations is shown in Fig. 5.14. This 
particular residual behavior is likely to be remedied by using 
a log transform of the response variable instead of the variab- 
le itself. Another approach is to use weighted least squares 
estimation procedures described later in this chapter. Though 
non-constant variance is easy to detect visually, its cause is 



Model 
Residuals 




Fig. 5.14 Residuals with bow shape and increased variability (i.e., 
error increases as the response variable y increases) indicate that a log 
transfonnation of y is required 



Model 
Residuals 




Fig. 5.15 Bow-shaped residuals suggest that a non-linear model, i.e. a 
model with a square term in the regressor variable to be evaluated 



difficult to identify. Figure 5.15 illustrates a typical behavior 
which arises when a linear function is used to model a qua- 
dratic variation. The proper corrective action will increase 
the predictive accuracy of the model (RMSE will be lower), 
result in the estimated parameters being more efficient (i.e., 
lower standard errors), and most importantly, allow more 
sound and realistic interpretation of the model prediction un- 
certainty bounds. 

Figure 5.16 illustrates the occurrence of serial correlati- 
ons in time series data which arises when the error terms are 
not independent. Such patterned residuals occur commonly 
during model development and provide useful insights into 
model deficiency. Serial congelation (or autocorrelation) has 
special pertinence to time series data (or data ordered in 
time) collected from in-situ performance of mechanical and 
thermal systems and equipment. Autocorrelation is present 
if adjacent model residuals, i.e., residuals show a trend or 
a pattern of clusters above or below the zero value that can 
be discerned visually. Such correlations can either suggest 
that additional variables have been left out of the model (mo- 
del-misspecification error), or could be due to the nature of 
the process itself (called pure or "pseudo" autocorrelation). 
The latter is due to the fact that equipment loading over a 
day would follow an overall cyclic curve (as against random 
jumps from say full load to half load) consistent with the 
diurnal cycle and the way the system is operated. In such ca- 
ses, positive residuals would tend to be followed by positive 
residuals, and vice versa. Time series data and models are 
treated further in Chap. 9. 

Problems associated with model underfitting and overfit- 
ting are usually the result of a failure to identify the non-ran- 
dom pattern in time series data. Underfitting does not cap- 
ture enough of the variation in the response variable which 
the corresponding set of regressor variables can possibly 
explain. For example, all four models fit to their respective 
sets of data as shown in Fig. 5.17, have identical R^ values 
and t-statistics but are distinctly different in how they capture 
the data variation. Only plot (a) can be described by a linear 
model. The data in (b) needs to be fitted by a higher order 
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Fig. 5.17 Plot of the data (x, y) 
with the fitted Hnes for four data 
sets. The models have identical 
R- and t-statistics but only the 
first model is a realistic model. 
(From Chatterjee and Price 1991 
by permission of John Wiley and 
Sons) 
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model, while one data point in (c) and (d) distorts the entire 
model. Blind model fitting (i.e., relying only on model statis- 
tics) is, thus, inadvisable. 

Overfitting implies capturing randomness in the model, 
i.e., attempting to fit the noise in the data. A rather extreme 
example is when one attempts to fit a model with six parame- 
ters to six data points which have some inherent experimen- 
tal error. The model has zero degrees of freedom and the set 
of six equations can be solved without error (i.e., RMSE=0). 
This is clearly unphysical because the model parameters 
have also "explained" the random noise in the observations 
in a deterministic manner. 

Both underfitting and overfitting can be detected by per- 
forming certain statistical tests on the residuals. The most 
commonly used test for white noise (i.e., uncorrelated re- 
siduals) involving model residuals is the Durbin-Watson 
(DW) statistic defined by: 



DW 



n ^ n 

i=2 i=l 



(5.42) 



where £. is the residual at time interval i, defined as 

1 ' 

e, = yi-yi- 

If there is no serial or autocorrelation present, the expec- 
ted value of DW is 2. If the model underfits, DW would be 
less than 2 while it would be greater than 2 for an overfitted 
model, the limiting range being 0-4. Tables are available for 
approximate significance tests with different numbers of re- 
gressor variables and number of data points. Table A. 13 as- 
sembles lower and upper critical values of DW statistics to 
test autocorrelation. For example, if n=20, and the model 
has three variables (p = 3), the null hypothesis that the corre- 



lation coefficient is equal to zero can be rejected at the 0.05 
significance level if its value is either below 1 .00 or above 
1.68. Note that the critical values in the table are one-sided, 
i.e., apply to one tailed distributions. 

It is important to note that the DW statistic is only sensi- 
tive to correlated errors in adjacent observations, i.e., when 
only first-order autocorrelation is present. For example, if 
the time series has seasonal patterns, then higher autocorre- 
lations may be present which the DW statistic will be unable 
to detect. More advanced concepts and modeling are discus- 
sed in Sect. 9.5 while treating stochastic time series data. 



5.6.2 Leverage and Influence Data Points 

Most of the aspects discussed above relate to identifying ge- 
neral patterns in the residuals of the entire data set. Anot- 
her issue is the ability to identify subsets of data that have 
an unusual or disproportionate influence on the estimated 
model in terms of parameter estimation. Being able to flag 
such influential subsets of individual points allows one to 
investigate their validity, or to glean insights for better expe- 
rimental design since they may contain the most interesting 
system behavioral information. Note that such points are not 
necessarily "bad" data points which should be omitted, but 
should be viewed as being "distinctive" observations in the 
overall data set. Scatter plots reveal such outliers easily for 
single regressor situations, but are inappropriate for multi- 
variate cases. Hence, several statistical measures have been 
proposed to deal with multivariate situations, the influence 
and leverage indices being widely used (Belsley et al. 1980; 
Cook and Weisberg 1982). 
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The leverage of a point quantifies the extent to which that 
point is "isolated" in the x-space, i.e., its distinctiveness in 
terms of the regressor variables. It has a large impact on the 
numerical values of the model parameters being estimated. 
Consider the following matrix (called the hat matrix): 



H = X(X'X)-'X' 



(5.43) 



tically significant changes in the fitted model coefficients. 
See Sect. 3.5.3 for a discussion based on graphical conside- 
rations of this concept. There are several measures used to 
describe influence, a common one is DFITS: 



DFITSi 



eKPh) 



1/2 



Sid -Pii) 



1/2 



(5.45) 



If one has a data set with two regressors, the order of the H ma- 
trix would be (3 x 3), i.e, equal to the number of parameters in 
the model (constant plus the two regressor coefficients). The 
diagonal element p.^ can be related to the distance between x. 
and X, and is defined as the leverage of the i* data point. Sin- 
ce the diagonal elements have values between and 1, their 
average value is equal to (p/n) where n is the number of obser- 
vation sets. Points with p >3 (p/n) are regarded as points with 
high leverage (sometimes the threshold is taken as 2 (p/n). 

Large residuals are traditionally used to highlight suspect 
data points or data points unduly affecting the regression mo- 
del. Instead of looking at residuals e,, it is more meaningful 
to study a normalized or scaled value, namely the standard- 
ized residuals or R-student residuals, where 



fii 



1V2 



(5.44) 



R-student = — 

RMSE ■ [l-p„J 

Points with IR-studentI > 3 can be said to be influence points 
which corresponds to a significance level of 0.01. Sometimes 
a less conservative value of 2 is used corresponding to the 
0.05 significance level, with the underlying assumption that 
residuals or errors are Gaussian. 

A data point is said to be influential if its deletion, singly 
or in combination with a relatively few others, cause statis- 



where e is the residual error of observation i, and s. is the 
standard deviation of the residuals without considering the i* 
residual. Points with DFITS > 2[p/(n - p)f^^ are flag- 
ged as influential points. 

Both the R-student statistic and the DFITS indices are of- 
ten used to detect influence points. In summary, just because 
a point has high leverage does not make it influential. It is ad- 
visable to identify points with high leverage, and, then, exa- 
mine them to determine whether they are influential as well. 

Influential observations can impact the final regression 
model in different ways (Hair et al. 1998). For example, in 
Fig. 5.18a, the model residuals are not significant and the two 
influential observations shown as filled dots reinforce the ge- 
neral pattern in the model and lower the standard error of the 
parameters and of the model prediction. Thus, the two points 
would be considered to be leverage points which are benefi- 
cial to our model building. Influential points which adversely 
impact model building are illustrated in Fig. 5.18b and c. In 
the former, the two influential points almost totally account 
for the observed relationship but would not have been identi- 
fied as outlier points. In Fig. 5.18c, the two influential points 
have totally altered the model identified, and the actual data 
points would have shown up as points with large residuals 
which the analyst would probably have identified as spurious. 



Fig. 5.18a-f Common patterns 
of influential observations. (From 
Hair et al. 1998 by © permission 
of Pearson Education) 




Regression slope without influentials • Influential observation 

Regression slope with influentials o Typical observation 



5.6 Model Residual Analysis 



161 



The next frame (d) illustrates the instance when an influential 
point changes the intercept of the model but leaves the slope 
unaltered. The two final frames, Fig. 5.18e and f, illustrate 
two, hard to identify and rectify, cases when two influential 
points reinforce each other in altering both the slope and the 
intercept of the model though their relative positions are very 
much different. Note that data points that satisfy both these 
statistical criteria, i.e., are both influential and have high le- 
verage, are the ones worthy of closer scrutiny. Most statisti- 
cal programs have the ability to flag such points, and hence 
performing this analysis is fairly straightforward. 

Thus, in conclusion, individual data points can be outliers, 
leverage or influential points. Outliers are relatively simple to 
detect and to interpret using the R-student statistic. Leverage 
of a point is a measure of how unusual the point lies in the x- 
space. An influence point is one which has an important affect 
on the regression model when that particular point were to 
be removed from the data set. Influential points are the ones 
which need particular attention since they provide insights 
about the robustness of the fit. In any case, all three measures 
(leverage p .., DFITS and R-student) provide indications as to 
the role played by different observations towards the over- 
all model fit. Ultimately, the decision of deciding whether to 
retain or reject such points is somewhat based on judgment. 

Example 5.6.1: Example highlighting different characteris- 
tic of outliers or residuals versus influence points. 
Consider the following made-up data (Table 5.5) where x 
ranges from 1 to 1 0, and the model isy=10H-1.5*xto which 
random normal noise e= [0, (T= 1] has been added to give y 
(second column). The last observation has been intentionally 
corrupted to a value of 50 as shown. 

How well a linear model fits the data is depicted in 
Fig. 5.19. The table of unusual residuals shown below lists 
all observations which have Studentized residuals grea- 
ter than 2.0 in absolute value. Note that observation 10 is 



Row 


X 




y 


Predicted 




Studentized residual 




y 


Residual 




10 


10.0 


50.0 


37.2572 


12.743 


11.43 


Table 5.5 


Data table for Example 


5.6.1 




X 








y[0,i] 




yi 


1 








11.69977 




11.69977 


2 








12.72232 




12.72232 


3 








16.24426 




16.24426 


4 








19.27647 




19.27647 


5 








21.19835 




21.19835 


6 








23.73313 




23.73313 


7 








21.81641 




21.81641 


8 








25.76582 




25.76582 


9 








29.09502 




29.09502 


10 








28.9133 




50 



flagged as an unusual residual (not surprising since this was 
intentionally corrupted) and no observation has been iden- 
tified as influential despite it being very much of an outlier 
(the studentized value is very large — recall that a value of 
3.0 would indicate a 99% CL). Thus, the error in one point 
seems to be overwhelmed by the well-behaved nature of the 
other nine points. This example serves to highlight the diffe- 
rent characteristic of outliers versus influence points. ■ 



5.6.3 Remedies for Non-uniform Model 
Residuals 

Non-uniform model residuals or heteroscedasticity can be 
due to: (i) the nature of the process investigated, (ii) noise in 
the data, or (iii) the method of data collection from samples 
which are known to have different variances. Three possible 
generic remedies for non-constant variance are to (Chatterjee 
and Price 1991): 

(a) introduce additional variables into the model and 
collect new data: The physics of the problem along 
with model residual behavior can shed light into whet- 
her certain key variables, left out in the original fit, need 
to be introduced or not. This aspect is further discussed 
in Sect. 5.6.5; 

(b) transform the dependent variable: This is appropria- 
te when the errors in measuring the dependent variable 
may follow a probability distribution whose variance is 
a function of the mean of the distribution. In such ca- 
ses, the model residuals are likely to exhibit heterosce- 
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Fig. 5.19 a Observed vs predicted plot, b Residual plot versus regressor 
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Table 5.6. Transformations in dependent variable 
non-uniform model variance 


y likely to stabilize 




Variance of y in 
terms of its mean ji 


Transformation 


Poisson 


n 


y./2 




Binomial 


fi(l-/j.)/n 


sin" 


l(y)./2 



y= 14.4481 +0.1 05361 *x 



elasticity which can be removed by using exponential, 
Poisson or Binomial transformations. For example, a 
variable which is distributed Binomially with parame- 
ters "n and p" has mean (n.p.) and variance [n.p.(l -p)] 
(Sect. 2.4.2). For a Poisson variable, the mean and vari- 
ance are equal. The transformations shown in Table 5.6 
will stabilize variance, and the distribution of the trans- 
formed variable will be closer to the normal distribution. 
The logarithmic transformation is also widely used in cer- 
tain cases to transform a non-linear model into a linear one 
(see Sect. 9.5.1). When the variables have a large standard 
deviation compared to the mean, working with the data on 
a log scale often has the effect of dampening variability and 
reducing asymmetry. This is often an effective means of re- 
moving heteroscedascity as well. However, this approach is 
valid only when the magnitude of the residuals increase (or 
decrease) with that of one of the variables. 

Example 5.6.2: Example of variable transformation to re- 
medy improper residual behavior 

The following example serves to illustrate the use of va- 
riable transformation. Table 5.7 shows data from 27 depart- 
ments in a university with y as the number of faculty and 
staff and x the number of students. 

A simple linear regression yields a model with R-squa- 
red=77.6% and a RMSE = 21.7293. However, the residuals 
reveal an unacceptable behavior with a strong funnel behavi- 
or (see Fig. 5.20a). 
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Fig. 5.20 a Residual plot of linear model, b Residual plot of log trans- 
formed linear model, c Residual plot of log transformed linear model 



Table 5.7 


Data table for Example 


5.6.2 








X 


y 




X 


y 


1 


294 


30 


15 


615 


100 


2 


247 


32 


16 


999 


109 


3 


267 


37 


17 


1,022 


114 


4 


358 


44 


18 


1,015 


117 


5 


423 


47 


19 


700 


106 


6 


311 


49 


20 


850 


128 


7 


450 


56 


21 


980 


130 


8 


534 


62 


22 


1,025 


160 


9 


438 


68 


23 


1,021 


97 


10 


697 


78 


24 


1,200 


180 


11 


688 


80 


25 


1,250 


112 


12 


630 


84 


26 


1,500 


210 


13 


709 


88 


27 


1,650 


135 


14 


627 


97 





Instead of a linear model in y, a linear model in ln(y) is 
investigated. In this case, the model R-squared=76.1% and 
RMSE =0.252396. However, these statistics should NOT be 
compared directly since the y variable is no longer the same 
(in one case, it is "y"; in the other "In y")- 

Let us not look into this aspect, but rather study the residu- 
al behavior. Notice that a linear model does reduce some of 
the improper residual variance but the inverted u shape beha- 
vior is indicative of model mis-specification (see Fig. 5.20b). 

Finally, using a quadratic model along with the In trans- 
formation results in a model: 

ln(y) = 2.8516 + 0.0031 1267*x - 0.000001 10226*x2 

The residuals shown in Fig. 5.20c are now quite well beha- 
ved as a result of such a transformation. ■ 
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(c) perform weighted least squares. This approach is 
more flexible and several variants exist (Chatterjee and Price 
1991). As described earlier, OLS model residual behavior 
can exhibit non-uniform variance (called heteroscedasticity) 
even if the model is structurally complete, i.e., the model 
is not mis-specified. This violates one of the standard OLS 
assumptions. In a multiple regression model, detection of he- 
teroscedasticity may not be very straight-forward since only 
one or two variables may be the culprits. Examination of the 
residuals versus each variable in turn along with intuition 
and understanding of the physical phenomenon being mo- 
deled can be of great help. Otherwise, the OLS estimates 
will lack precision, and the estimated standard errors of the 
model parameters will be wider. If this phenomenon occurs, 
the model identification should be redone with explicit re- 
cognition of this fact. 

During OLS, the sum of the model residuals of all points 
are minimized with no regard to the values of the individual 
points or to points from different domains of the range of 
variability of the regressors. The basic concept of weight- 
ed least squares (WLS) is to simply assign different weights 
to different points according to a certain scheme. Thus, the 
general formulation of WLS is that the following function 
should be minimized: 



WLS function 



I]wi(yi 



5lXli 



5pXpiJ 



(5.46) 



where w are the weights of individual points. These are for- 
mulated differently depending on the weighting scheme se- 
lected which, in turn, depends on prior knowledge about the 
process generating the data. 

(c-i) Errors Are Proportional to x Resulting in Funnel- 
Shaped Residuals Consider the simple model y = a+fix+e 
whose residuals e have a standard deviation which increa- 
ses as the regressor variable (resulting in the funnel-like 
shape in Fig. 5.21). Assuming a weighting scheme such as 
var(ei) — ^^x?, transforms the model into: 

- = -+;6+ - ory' = ax'-|-^-|-e' (5-47) 

XX X 

Note that the variance of g' is constant and equals k^. If the 
assumption about the weighting scheme is correct, the trans- 
formed model will be homoscedastic, and the model para- 
meters a andyS will be efficiently estimated by OLS (i.e., the 
standard errors of the estimates will be optimal). 

The above transformation is only valid when the model re- 
siduals behave as shown in Fig. 5.21. If residuals behave dif- 
ferently, then different transformations or weighting schemes 
will have to be explored. Whether a particular transformation 
is adequate or not can only be gauged by the behavior of the 
variance of the residuals. Note that the analyst has to perform 
two separate regressions: one an OLS regression in order to 
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Fig. 5.21 Type of heteroscedastic model residual behavior which ari- 
ses when errors are proportional to the magnitude of the x variable 



determine the residual amounts of the individual data points, 
and then a WLS regression for final parameter identification. 
This is often referred to as two-stage estimation. 

(c-ii) Replicated Measurements with Different Variance It 
could happen, especially with models involving one re- 
gressor variable only and when the data is obtained in the 
framework of a designed experimental study (as against ob- 
servational or non-experimental data), that one obtains re- 
plicated measurements on the response variable correspon- 
ding to a set of fixed values of the explanatory variables. 
For example, consider the case when the regressor variable 
X takes several discrete values. If the physics of the phe- 
nomenon cannot provide any theoretical basis on how to 
select a particular weighty scheme, then this has to be de- 
termined experimentally from studying the data. If there is 
an increasing pattern in the heteroscedascity present in the 
data, this could be modeled either by a logarithmic trans- 
form (as illustrated in Example 5.6.2) or a suitable variab- 
le transformation. Here, another more versatile approach 
which can be applied to any pattern of the residuals is illus- 
trated. Each observed residual e (where the index for di- 
screte X values is i, and the number of observations at each 
discrete x value is j = l, 2, ... n.) is made up of two parts, 
i.e., £^,= (yjj — yi) + (yi — yij) . The first part is referred to 
as pure error while the second part measures lack of fit. 
An assessment of heteroscedasticity is based on pure error. 
Thus, the WLS weight may be estimated as Wi — \/sf 
where the mean square error is: 



s? = 



E(y.j-yO' 
(«,■ - 1) 



(5.48) 



Alternatively a model can be fit to the mean values of x and 
the sf values in order to smoothen out the weighting functi- 
on, and this function used instead. Thus, this approach would 
also qualify as a two-stage estimation process. The following 
example illustrates this approach. 
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Least Squares 


Table 5.8 


Measured data, OLS residuals deduced from Eq. 5.49a and the weights calculated from 


Eq. 5.49b 






X 


y 


Residual a 


w. 


X 


y 


Residual e, 


w. 


1.15 


0.99 


0.26329 


0.9882 


9.03 


9.47 


-0.20366 


0.4694 


1.90 


0.98 


-0.59826 


1.7083 


9.07 


11.45 


1.730922 


0.4614 


3.00 


2.60 


-0.2272 


6.1489 


9.11 


12.14 


2.375506 


0.4535 


3.00 


2.67 


-0.1572 


6.1489 


9.14 


11.50 


1.701444 


0.4477 


3.00 


2.66 


-0.1672 


6.1489 


9.16 


10.65 


0.828736 


0.4440 


3.00 


2.78 


-0.0472 


6.1489 


9.37 


10.64 


0.580302 


0.4070 


3.00 


2.80 


-0.0272 


6.1489 


10.17 


9.78 


-1.18802 


0.3015 


5.34 


5.92 


0.435964 


15.2439 


10.18 


12.39 


1.410628 


0.3004 


5.38 


5.35 


-0.17945 


13.6185 


10.22 


11.03 


0.005212 


0.2963 


5.40 


4.33 


-1.22216 


12.9092 


10.22 


8.00 


-3.02479 


0.2963 


5.40 


4.89 


-0.66216 


12.9092 


10.22 


11.90 


0.875212 


0.2963 


5.45 


5.21 


-0.39893 


11.3767 


10.18 


8.68 


-2.29937 


0.3004 


7.70 


7.68 


-0.48358 


0.9318 


10.50 


7.25 


-4.0927 


0.2696 


7.80 


9.81 


1.53288 


0.8768 


10.23 


13.46 


2.423858 


0.2953 


7.81 


6.52 


-1.76847 


0.8716 


10.03 


10.19 


-0.61906 


0.3167 


7.85 


9.71 


1.37611 


0.8512 


10.23 


9.93 


-1.10614 


0.2953 


7.87 


9.82 


1.463402 


0.8413 




7.91 


9.81 


1.407986 


0.8219 




7.94 


8.50 


0.063924 


0.8078 





Example 5.6.3:'' Example of weighted regression for repli- 
cate measurements 

Consider the data given in Table 5.8 of replicate measure- 
ments of y taken at different values of x (which vary slight- 

ly)- 

A scatter plot of this data and the simple OLS linear mo- 
del are shown in Fig. 5.22a. The regressed model is: 

y = -0.578954 + 1.1354*x with R^ = 0.841 (5 49a) 
and RMSE= 1.4566 

Note that the intercept term in the model is not statistically 
significant (p-value = 0.4 for the t-statistic), while the overall 
model fit given by the F-ratio is significant. The model resi- 
duals of a simple OLS fit are shown in Fig. 5.22b. 



The residuals of a simple linear OLS model shown in 
Fig. 5.22b reveal, as expected, marked heteroscadascity. 
Hence, the OLS model is bound to lead to misleading uncer- 
tainty bands even if the model predictions themselves are not 
biased. The model residuals from the above model are also 
shown in the table. Subsequently, the mean and the mean 
square error sj are calculated following Eq. 5.48 to yield 
the following table: 



X 


^? 




3 


0.0072 




5.39 


0.373 




7.84 


1.6482 




9.15 


0.8802 




10.22 


4.1152 





Coefficients 



Parameter 


Least squares 
estimate 


Standard 
error 


t-statistic 


P-value 


Intercept 


-0.578954 


0.679186 


-0.852423 


0.4001 


Slope 


1.1354 


0.086218 


13.169 


0.0000 


Analysis 


of Variance 










Source 


Sum of 
squares 


Df 


Mean 
square 


F-ratio 


P-value 


Model 


367.948 


1 


367.948 


173.42 


0.0000 


Residual 


70.0157 


33 


2.12169 






Total (Corr.) 437.964 


34 









* From Draper and Smith (1981) by permission of John Wiley and 
Sons. 



Then, because of the pattern exhibited, a second order po- 
lynomial OLS model is regressed to this data (see Fig. 5.22c): 

s] = 1.887 - 0.8727.x + 0.9967 .x^ with R^ = 0.743 

(5.49b) 

The regression weights w can thus be deduced by using in- 
dividual values of X instead of x in the above equation. The 
values of the weights are also shown in the data table. Fi- 
nally, a weighted regression is performed following Eq. 5.46 
(most statistical packages have this capability) resulting in: 



y = -0.942228-1- 1.16252*x 
and RMSE= 1.2725. 



with R-" = 0.896 



(5.49c) 



5.6 Model Residual Analysis 
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y = -0.578954 + 1.1 354*x 



15 
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Fig. 5.22 a Data set and OLS regression line of observations with non- 
constant variance and replicated observations in x. b Residuals of a 
simple linear OLS model fit (Eq. 5.49a). c Residuals of a second order 
polynomial OLS fit to the mean x and mean square error (MSE) of 
the replicate values (Eq. 5.49b). d Residuals of the weighted regression 
model (Eq. 5.49c) 

The residual plots are shown as Fig. 5.22d. Though the good- 
ness of fit is only slightly better than the OLS model, the 
real advantage is that this model will have better prediction 
accuracy and realistic prediction errors. ■ 



(c-iii) Non-patterned Variance in the Residuals A third type 
of non-constant residual variance is one when no pattern is di- 
scerned with respect to the regressors which can be discrete or 
vary continuously. In this case, a practical approach is to look 
at a plot of the model residuals against the response variable, 
divide the range in the response variable into as many regions 
as seem to have different variances, and calculate the standard 
deviation of the residuals for each of these regions. In that 
sense, the general approach parallels the one adopted in case 
(c-ii) when dealing with replicated values with non-constant 
variance; however, now, no model such as 5.49b is needed. 
The general approach would involve the following steps: 

• First, fit an OLS model to the data; 

• Next, discretize the domain of the regressor variables into 
a finite number of groups and determine e? from which 
the weights w for each of these groups can be deduced; 

• Finally, perform a WLS regression in order to estimate the 
efficient model parameters. 

Though this two-stage estimation approach is conceptually 
easy and appealing for simple models, it may become rather 
complex for multivariate models, and moreover, there is no 
guarantee that heteroscedasticity will be removed entirely. 



5.6.4 Serially Correlated Residuals 

Another manifestation of improper residual behavior is se- 
rial correlation (discussed in Sect. 5.6.1). As stated earlier, 
one should distinguish between the two different types of 
autocorrelation, namely pure autocorrelation and model- 
misspecification, though often it is difficult to discern bet- 
ween them. The latter is usually addressed using the weight 
matrix approach (Pindyck and Rubinfeld 1981) which is fair- 
ly formal and general, but somewhat demanding. Pure auto- 
correlation relates to the case of "pseudo" patterned residual 
behavior which arises because the regressor variables have 
strong serial correlation. This serial correlation behavior is 
subsequently transferred over to the model, and thence to its 
residuals, even when the regression model functional form is 
close to "perfect". The remedial approach to be adopted is to 
transform the original data set prior to regression itself. The- 
re are several techniques of doing so, and the widely-used 
Cochrane-Orcutt (CO) procedure is described. It involves the 
use of generalized differencing to alter the linear model into 
one in which the errors are independent. The two stage first- 
order CO procedure involves: 
(i) fitting an OLS model to the original variables; 
(ii) computing the first-order serial correlation coefficient/) 

of the model residuals; 
(iii) transforming the original variables y and x into a new 

set of pseudo- variables: 



y, 



yi - p- yt-\ 



and 



x,-p- x,„i (5.50) 
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(iv) OLS regressing of the pseudo variables y* and x* to 
re-estimate the parameters of the model; 

(v) Finally, obtaining the fitted regression model in the ori- 
ginal variables by a back transformation of the pseudo 
regression coefficients: 

^o = ^o7(l-p) and bi^bi* (551) 

Though two estimation steps are involved, the entire pro- 
cess is simple to implement. This approach, when originally 
proposed, advocated that this process be continued till the 
residuals become random (say, based on the Durbin-Watson 
test). However, the current recommendation is that alterna- 
tive estimation methods should be attempted if one iteration 
proves inadequate. This approach can be used during para- 
meter estimation of MLR models provided only one of the 
regressor variables is the cause of the pseudo-correlation. 
Also, a more sophisticated version of the CO method has 
been suggested by Hildreth and Lu (Chatterjee and Price 
1991) involving only one estimation process where the opti- 
mal value of p is determined along with the parameters. This, 
however, requires non-linear estimation methods. 

Example 5.6.4: Using the Cochrane-Orcutt procedure to 
remove first-order autocorrelation 

Consider the case when observed pre-retrofit data of ener- 
gy consumption in a commercial building support a linear 
regression model as follows: 

Ei^ao + ayTi (5-52) 

where 

T = daily average outdoor dry-bulb temperature, 

iij = daily total energy use predicted by the model, 

i = subscript representing a particular day, and, 

a^ and a, are the least-square regression coefficients 

How the above transformation yields a regression model 
different from OLS estimation is illustrated in Fig. 5.23 with 
year-long daily cooling energy use from a large institutional 
building in central Texas. The first-order autocorrelation co- 
efficients of cooling energy and average daily temperature 
were both equal to 0.92, while that of the OLS residuals was 
0.60. The Durbin-Watson statistic for the OLS residuals (i.e. 
untransformed data) was DW=3 indicating strong residual 
autocorrelation, while that of the CO transform was 1.89 in- 
dicating little or no autocorrelation. Note that the CO trans- 
form is inadequate in cases of model mis- specification and/ 
or seasonal operational changes. ■ 



5.6.5 Dealing with Misspecified Models 

An important source of error during model identification is 
model misspecification error. This is unrelated to measure- 
ment error, and arises when the functional form of the model 
is not appropriate. This can occur due to: 
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Fig. 5.23 How serial correlation in the residuals affects model identi- 
fication (Example 5.6.4) 

(i) inclusion of irrelevant variables: This does not bias the 
estimation of the intercept and slope parameters, but ge- 
nerally reduces the efficiency of the slope parameters, 
i.e., their standard errors will be larger. This source of 
error can be eliminated by, say, step-wise regression or 
simple tests such as t-tests; 
(ii) exclusion of an important variable: This case will result in 

the slope parameters being both biased and inconsistent, 
(iii) assumption of a linear model: This arises when a linear 

model is erroneously assumed, and 
(iv) incorrect model order: This corresponds to the case 
when one assumes a lower or higher model than what 
the data warrants. 
The latter three sources of errors are very likely to mani- 
fest themselves in improper residual behavior (the residuals 
will show sequential or non-constant variance behavior). The 
residual analysis may not identify the exact cause, and se- 
veral attempts at model reformulations may be required to 
overcome this problem. Even if the physics of the pheno- 
menon or of the system is well understood and can be cast 
in mathematical terms, experimental or identifiability cons- 
traints may require that a simplified or macroscopic model 
be used for parameter identification rather than the detailed 
model. This could cause model misspecification, especially 
so if the model is poor. 

Example 5.6.5: Example to illustrate how inclusion of ad- 
ditional regressors can remedy improper model residual be- 
havior 

Energy use in commercial buildings accounts for about 
18% of the total energy use in the United States and con- 
sequently, it is a prime area of energy conservation efforts. 
For this purpose, the development of baseline models, i.e., 
models of energy use for a specific end-use before energy 
conservation measures are implemented, is an important mo- 
deling activity for monitoring and verification studies. 
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Fig. 5.24 Improvement in 
residual behavior for a model of 
hourly energy use of a variable 
air volume HVAC system in a 
commercial building as influen- 
tial regressors are incrementally 
added to the model. (From Kati- 
pamulaetal. 1998) 
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Let us illustrate the effect of improper selection of re- 
gressor variables or model misspecification for modeling 
measured thermal cooling energy use of a large commer- 
cial building operating 24 hours a day under a variable air 
volume HVAC system (Katipamula et al. 1998). Figure 5.24 
illustrates the residual pattern when hourly energy use is 
modeled with only the outdoor dry-bulb temperature (T). 
The residual pattern is blatantly poor exhibiting both non- 
constant variance as well as systematic bias in the low range 
of the x-variable. Once the outdoor dew point temperature 
(T^p) ^, the global horizontal solar radiation (q^^^) and the in- 



^ Actually the outdoor humidity impacts energy use only when the dew 
point temperature exceeds a certain threshold which many studies have 
identified to be about 55°F (this is related to how the HVAC is control- 
led in response to human comfort). This type of conditional variable is 
indicated as a -F superscript. 



temal building heat loads q (such as lights and equipment) 
are introduced in the model, the residual behavior improves 
significantly but the lower tail is still present. Finally, when 
additional terms involving indicator variables I to both inter- 
cept (T) are introduced, (described in Sect. 5.7.2), an accep- 
table residual behavior is achieved. ■ 



5.7 Other OLS Parameter Estimation Methods 

5.7.1 Zero-Intercept Models 

Sometimes the physics of the system dictates that the re- 
gression line pass through the origin. For the linear case, the 
model assumes the form: 



y 



iix 



(5.53) 
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The interpretation of R^ under such a case is not the same 
as for the model with an intercept, and this statistic cannot 
be used to compare the two types of models directly. Recall 
that the R^ value designated the percentage variation of the 
response variable about its mean explained by that of the 
regressor variable. For the no-intercept case, the R^ value ex- 
plains the percentage variation of the response variable about 
the origin explained by that of the regressor variable. Thus, 
when comparing both models, one should decide on which is 
the better model based on their RMSE values. 



y^Pa + fi^x + p2{x - x,)I 
where the indicator variable 



1 if X > Xc 
otherwise 



(5.54a) 



(5.54b) 



Hence, for the region x < Xc, , the model is: 

y^Po + fiix (5.55) 



5.7.2 Indicator Variables for Local Piecewise 
Models — Spline Fits 

Spline functions are an important class of functions, descri- 
bed in numerical analysis textbooks in the framework of in- 
terpolation, which allow distinct functions to be used over 
different ranges while maintaining continuity in the function. 
They are extremely flexible functions in that they allow a 
wide range of locally different behavior to be captured wit- 
hin one elegant functional framework. Thus, a globally non- 
linear function can be decomposed into simpler local pat- 
terns. Two cases arise. 

(a) The simpler case is one where it is known which points 
lie on which trend, i.e., when the physics of the system 
is such that the location of the structural break or "hinge 
point" X of the regressor is known. The simplest type 
is the piece-wise linear spline (as shown in Fig. 5.25), 
with higher order polynomial splines up to the third de- 
gree being also used often to capture non-linear trends. 
The objective here is to formulate a linear model and 
identify its parameters which best describe data points 
in Fig. 5.25. One cannot simply divide the data into 
two, and fit each region with a separate linear model 
since the constraint that the model be continuous at the 
hinge point would be violated. A model of the following 
form would be acceptable: 



CO 



Q 
DC 




Regressor variable 

Fig. 5.25 Piece-wise linear model or first-order spline fit with hinge 
point at X . Such models are referred to as change point models in build- 
ing energy modeling terminology 



and for the region X > Xc j = (;So — PiXc) + {fii + Pi)^- 
Thus, the slope of the model is P^ before the break and 
ifi^+P^) afterwards. The intercept term changes as well 
fromySji before the break to {fit, — fiiXc) after the break. 
The logical extensions to linear spline models with two 
structural breaks or to higher order splines involving 
quadratic and cubic terms are fairly straightforward, 
(b) The second case arises when the change point is not 
known. A simple approach is to look at the data, iden- 
tify a "ball-park" range for the change point, perform 
numerous regression fits with the data set divided ac- 
cording to each possible value of the change point in 
this ball-park range, and pick that value which yields 
the best overall R-square or RMSE. Alternatively, the 
more accurate but more complex approach is to cast 
the problem as a nonlinear estimation method with the 
change point variable as one of the parameters. 

Example 5.7.1: Change point models for building utility 
bill analysis 

The theoretical basis of modeling monthly energy use in 
buildings is discussed in several papers (for example, Red- 
dy et al. 1997). The interest in this particular time scale is 
obvious — such information is easily obtained from utility 
bills which are usually on a monthly time scale. The mo- 
dels suitable for this application are similar to linear spli- 
ne models, and are referred to as change point models by 
building energy analysts. A simple example is shown below 
to illustrate the above equations. Electricity utility bills of 
a residence in Houston, TX have been normalized by the 
number of days in the month and assembled in Table 5.9 
along with the corresponding month and monthly mean 
outdoor temperature values for Houston (the first three co- 
lumns of the table). The intent is to use Eq. 5.54 to model 
this behavior. 

The scatter plot and the trend lines drawn in Fig. 5.26 
suggest that the change point is in the range 17-19°C. Let us 
perform the calculation assuming a value of 17°C. Defining 
an indicator variable: 



I 



1 ifx>17°C 
otherwise 
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Table 5. 

deducinj 
of 17°C 



,9 Measured monthly energy use data and calculation step for 
I the change point independent variable assuming a base value 



Month 


Mean outdoor 

temperature 

(°C) 


Monthly mean 
daily electric use 
(kWh/m-/day) 


X 

(°C) 


(x-17°C)/ 
(°C) 


Jan 


11 


0.1669 




11 





Feb 


13 


0.1866 




13 





Mar 


16 


0.1988 




16 





Apr 


21 


0.2575 




21 


4 


May 


24 


0.3152 




24 


7 


Jun 


27 


0.3518 




27 


10 


Jul 


29 


0.3898 




29 


12 


Aug 


29 


0.3872 




29 


12 


Sept 


26 


0.3315 




26 


9 


Oct 


22 


0.2789 




22 


5 


Nov 


16 


0.2051 




16 





Dec 


13 


0.1790 




13 






Based on this assumption, the last two columns of the table 
have been generated to correspond to the two regressor va- 
riables in Eq. 5.54. A linear multiple regression yields: 

y = 0.1046 + 0.005904JC + 0.00905(x - 17)/ 
with r2 = 0.996 and RMSE = 0.0055 

with all three parameters being significant. The reader can 
repeat this analysis assuming a different value for the chan- 
ge point (say x =18°C) in order to study the sensitivity of 
the model to the choice of the change point value. Though 
only three parameters are determined by regression, this is 
an example of a four parameter (or 4-P) model in building 
science terminology. The fourth parameter is the change 
point X which also needs to be determined. Software pro- 
grams have been developed to determine the optimal value 
of X (i.e., that which results in minimum RMSE of different 
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Fig. 5.26 Piece-wise linear regression lines for building electric use 
with outdoor temperature. The change point is the point of intersection 
of the two lines. The combined model is called a change point model, 
which, in this case, is a four parameter model given by Eq. 5.54 



possible choices of x^) following a numerical search process 
akin to the one described in this example. ■ 



5.7.3 Indicator Variables for Categorical 
Regressor Models 

The use of indicator (also called dummy) variables has been 
illustrated in the previous section when dealing with spline 
models. They are also used in cases when shifts in either the 
intercept or the slope are to be modeled with the condition 
of continuity now being relaxed. The majority of variables 
encountered in mechanistic models are quantitative, i.e., the 
variables are measured on a numerical scale. Some examples 
are temperature, pressure, distance, energy use and age. Oc- 
casionally, the analyst comes across models involving quali- 
tative variables, i.e., regressor data that belong in one of two 
(or more) possible categories. One would like to evaluate 
whether differences in intercept and slope between catego- 
ries are significant enough to warrant two separate models or 
not. This concept is illustrated by the following example. 

Whether the annual energy use of a regular commercial 
buildings is markedly higher than that of another certified as 
being energy efficient is to be determined. Data from several 
buildings which fall in each group is gathered to ascertain 
whether the presumption is supported by the actual data. 
Factors which affect the normalized energy use (variable y) 
of both experimental groups are conditioned floor area (va- 
riable X|) and outdoor temperature (variable x^). Suppose that 
a linear relationship can be assumed with the same intercept 
for both groups. One approach would be to separate the data 
into two groups: one for regular buildings and one for ef- 
ficient buildings, and develop regression models for each 
group separately. Subsequently, one could perform a t-test 
to determine whether the slope terms of the two models are 
significantly different or not. However, the assumption of 
constant intercept term for both models may be erroneous, 
and this may confound the analysis. A better approach is to 
use the entire data and adopt a modeling approach involving 
indicator variables. 

Let model 1 be for regular buildings: 

y = a-i-bjX|H-C|X2 

and, model 2 be for energy efficient buildings: 

y = a + b2 Xi + C2 X2 (5-56) 

The complete model (or model 3) would be formulated as: 

y = a + bi xi + ci X2 + b2(I ■ xi) + C2(I ■ X2) (5.57) 
where I is an indicator variable such that 

f 1 for energy efficient buildings 
~ I for regular buildings 
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Note that a basic assumption in formulating this model is 
that the intercept is unaffected by the building group. For- 
mally, one would like to test the null hypothesis H: b^=c^=0. 
The hypothesis is tested by constructing an F statistic for the 
comparison of the two models. Note that model 3 is referred 
to as the full model (FM) or as the pooled model. Model 1, 
when the null hypothesis holds, is the reduced model (RM). 
The idea is to compare the goodness-of-fit of the FM and that 
of the RM. If the RM provides as good a fit as the FM, then 
the null hypothesis is valid. Let SSE(FM) and SSE(RM) be 
the corresponding model sum of square eiTor or squared mo- 
del residuals. Then, the following F-test statistic is defined: 



F = 



[SSE(RM) - SSE(FM)]/(k-m) 
SSE(FM)/(n-k) 



(5.58) 



where n is the number of data sets, k is the number of para- 
meters of the FM, and m the number of parameters of the 
RM. If the observed F value is larger than the tabulated value 
of F with (n - k) and (k - m) degrees of freedom at the pre- 
specified significance level (provided by Table A. 6), the RM 
is unsatisfactory and the full model has to be retained. As 
a cautionary note, this test is strictly valid only if the OLS 
assumptions for the model residuals hold. 

Example 5.7.2: Combined modeling of energy use in regu- 
lar and energy efficient buildings 

Consider the data assembled in Table 5.10. Let us desig- 
nate the regular buildings by group (A) and the energy effi- 
cient buildings by group (B), with the problem simplified by 
assuming both types of buildings to be located in the same 
geographic location. Hence, the model has only one regres- 
sor variable involving floor area. The complete model with 
the indicator variable term given by Eq. 5.57 is used to verify 
whether group B buildings consume less energy than group 
A buildings. 

The full model (FM) given by Eq. 5.57 reduces to the fol- 
lowing form since only one regressor is involved: 



Table 5.1 Data table for Example 


5.7.2 






Energy 
use (y) 


Floor 
area (Xj) 


Bldg 
type 


Energy 
use (y) 


Floor 
area (x^) 


Bldg 
type 


45.44 


225 


A 


32.13 


224 


B 


42.03 


200 


A 


35.47 


251 


B 


50.1 


250 


A 


33.49 


232 


B 


48.75 


245 


A 


32.29 


216 


B 


47.92 


235 


A 


33.5 


224 


B 


47.79 


237 


A 


31.23 


212 


B 


52.26 


265 


A 


37.52 


248 


B 


50.52 


259 


A 


37.13 


260 


B 


45.58 


221 


A 


34.7 


243 


B 


44.78 


218 


A 


33.92 


238 


B 



y = aH-bjX|-i-bJ-x, where the variable I is an indicator va- 
riable such that it is for group A and 1 for group B. The 
null hypothesis is that H^: b,=0. The reduced model (RM) 
is y = aH-b x^. 

The estimated model is y= 14.2762 H-0.14115 Xj- 
13.2802 (Lx,). The analysis of variance shows that the 
SSR(FM)=77943 and SSR(RM) = 889.245. The F statistic 
in this case is: 

(889.245 -7. 7943)/ 1 

F = ^^ ^— = 1922.5 

7.7943/(20 - 3) 

One can thus safely reject the null hypothesis, and state with 
confidence that buildings built as energy-efficient ones con- 
sume energy which is statistically lower than those which 
are not. 

It is also possible to extend the analysis and test whether 
both slope and intercept are affected by the type of building. 
The FM in this case is y=ajH-b| XjH-c(I) + d(l ■ xi) where I 
is an indicator variable which is, say for Building A and 
1 for Building B. The null hypothesis in this case is that 
c = d=Q. This is left for the interested reader to solve. ■ 



5.7.4 Assuring Model Parsimony — Stepwise 
Regression 

Perhaps the major problem with multivariate regression is 
that the "independent" variables are not really independent 
but collinear to some extent (how to deal with collinear 
regressors by transformation is discussed in Sect. 9.3). In 
multivariate regression, a thumb rule is that the number of 
variables should be less than four times the number of ob- 
servations (Chatfield 1995). Hence, with n=12, the number 
of variables should be at most 3 or less. Moreover, some aut- 
hors go so far as stating that multivariate regression models 
with more than 4-5 variables are suspect. There is, thus, a 
big benefit in identifying models that are parsimonious. The 
more straightforward approach is to use the simpler (but for- 
mal) methods to identify/construct the "best" model linear 
in the parameters if the comprehensive set of all feasible/ 
possible regressors of the model is known (Draper and Smith 
1981; Chatterjee and Price 1991): 

(a) All possible regression models: This method involves: 
(i) constructing models of different basic forms (single va- 
riate with various degrees of polynomials and multi-varia- 
te), (ii) estimating parameters that correspond to all possible 
predictor variable combinations, and (iii) then selecting one 
considered most desirable based on some criterion. While 
this approach is thorough, the computational effort involved 
may be significant. For example, with p possible parameters, 
the number of model combinations would be p-. However, 
this may be moot if the statistical analysis program being 



5.7 Other OLS Parameter Estimation IVlethods 



171 



used contains such a capability. The only real drawback is 
that blind curve fitting may suggest a model with no physical 
justification which in certain applications may have unde- 
sirable consequences. Further, it is advised that the cross- 
validation scheme should be used to avoid overfitting (see 
Sect. 5.3.2-d). 

In any case, one needs a statistical criterion to determine, 
if not the "best''" model, then, at least a subset of desirable 
models from which one can be chosen based on the physics 
of the problem. One could use the adjusted R-square given 
by Eq. 5.7b which includes the number of model parameters. 

Another criterion for model selection is the Mallows C sta- 

p 

tistic which gives a normalized estimate of the total expected 
estimation error for all observations in the data set and takes 
account of both bias and variance: 



C„ 



SSE 



(2/7 - n) 



(5.59) 



where SSE is the sum of square errors (see Eq. 5.2), a^ is the 
variance of the residuals with the full set of variables, n is the 
number of data points, and p is the number of parameters in 
the specific model. It can be shown that the expected value of 
C is p when there is no bias in the fitted equation containing 
p terms. Thus "good" or desirable model possibilities are 
those whose C ^ values are close to the corresponding number 
of parameters of the model. 

Another automatic selection approach to handling models 
with large number of possible parameters is the iterative ap- 
proach which comes in three variants. 

(b-1) Backward Elimination Method: One begins with se- 
lecting an initial model that includes the full set of possible 
predictor variables from the candidate pool, and then succes- 
sively dropping one variable at a time on the basis of their 
contribution to the reduction of SSE. The OLS method is 
used to estimate all model parameters along with t-values 
for each model parameter. If all model parameters are statis- 
tically significant, the model building process stops. If some 
model parameters are not significant, the model parameter 
of least significance (lowest t-value) is omitted from the 
regression equation, and the reduced model is refit. This pro- 
cess continues until all parameters that remain in the model 
are statistically significant. 

(b-2) Forward Selection Method: One begins with an equati- 
on containing no regressors (i.e., a constant model). The model 
is then augmented by including the regressor variable with the 
highest simple correlation with the response variable. If this 
regression coefficient is significantly different from zero, it is 
retained, and the search for a second variable is made. This 



^ Actually, tliere is no "best" model since random variables are invol- 
ved. A better term would be "most plausible" and should include me- 
chanistic considerations, if appropriate. 



process of adding regressors one-by-one is terminated when 
the last variable entering the equation has an insignificant re- 
gression coefficient or when all the variables are included in 
the model. Clearly, this approach involves fitting many more 
models than in the backward elimination method. 

(b-3) Stepwise Regression Method: This is one of the more 
powerful model building approaches and combines both the 
above procedures. Stepwise regression begins by computing 
correlation coefficients between the response and each pre- 
dictor variable. The variable most highly correlated with the 
response is then allowed to "enter the regression equation". 
The parameter for the single-variable regression equation is 
then estimated along with a measure of the goodness of fit. 
The next most highly correlated predictor variable is iden- 
tified, given the current variable already in the regression 
equation. This variable is then allowed to enter the equation 
and the parameters re-estimated along with the goodness of 
fit. Following each parameter estimation, t-values for each 
parameter are calculated and compared to t-critical to deter- 
mine whether all parameters are still statistically significant. 
Any parameter that is not statistically significant is removed 
from the regression equation. This process continues until no 
more variables "enter" or "leave" the regression equation. In 
general, it is best to select the model that yields a reasona- 
bly high "goodness of fif ' for the fewest parameters in the 
model (referred to as model parsimony). The final decision 
on model selection requires the judgment of the model buil- 
der, and on mechanistic insights into the problem. Again, one 
has to guard against the danger of overfitting by performing 
a cross-validation check. 

When a black-box model is used containing several re- 
gressors, step-wise regression would improve the robustness 
of the model by reducing the number of regressors in the mo- 
del, and thus hopefully reduce the adverse effects of multi- 
collinearity between the remaining regressors. Many packa- 
ges use the F-test indicative of the overall model instead of 
the t-test on individual parameters to perform the step-wise 
regression. A value of F=4 is often chosen. It is suggested 
that step-wise regression not be used in case the regress- 
ors are highly correlated since it may result in non-robust 
models. However, the backward procedure is said to better 
handle such situations than the forward selection procedure. 

A note of caution is warranted in using stepwise regress- 
ion for engineering models based on mechanistic conside- 
rations. In certain cases, stepwise regression may omit a re- 
gressor which ought to be influential when using a particular 
data set, while the regressor is picked up when another data 
set is used. This may be a dilemma when the model is to 
be used for subsequent predictions. In such cases, discretion 
based on physical considerations should trump purely statis- 
tical model building. 
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Example 5.7.3:' Proper model identification with multiva- 
riate regression models 

An example of multivariate regression is the development 
of model equations to characterize the performance of ref- 
rigeration compressors. It is possible to regress compressor 
manufacturer's tabular data of compressor performance using 
the following simple bi-quadratic formulation (see Fig. 5.11 
for nomenclature). 



y — Co -\- C\ ■ Tcho + Ci ■ Tccii + Ci ■ T'^ 



cho 



(5.60) 



+ C,-T 



cdi 



C5 • Trim ■ T, 



'cdi 



where y represents either the compressor power (P^^,,, ) or the 
cooling capacity (Q^.^). 

OLS is then used to develop estimates of the six model 
parameters, C^-C^, based on the compressor manufacturer's 
data. The biquadratic model was used to estimate the para- 
meters for compressor cooling capacity (in Tons) for a screw 
compressor. The model and its corresponding parameter es- 
timates are given below. Although the overall curve fit for the 
data was excellent (/?- = 99.96%), the t- values of two para- 
meter estimates (C^ and C^) are clearly insignificant. 

A second stage regression is done omitting these regress- 
ors resulting in the following model and coefficient t- values 
shown in Table 5.11. 

y — Cq + Ci ■ Tcho + Ci ■ T^ijo + C5 ■ Tcho ■ Tcdi 

All of the parameters in the simplified model are significant 
and the overall model fit remains excellent: R- = 99.5%. ■ 



5.8 Case Study Example: Effect of Refrigerant 
Additive on Chiller Performance^ 

The objective of this analysis was to verify the claim of 
a company which had developed a refrigerant additive to 
improve chiller COP. The performance of a chiller before 



Table 5.1 1 Results of the first and second stage model building 



With all parameters 




With significant 
only 

Value 


parameters 


Coefficient 


Value 


t-value 


t-value 


Co 


152.50 


6.27 


114.80 


73.91 


c, 


3.71 


36.14 


3.91 


11.17 


c, 


-0.335 


-0.62 


- 


- 


c. 


0.0279 


52.35 


0.027 


14.82 


c. 


-0.000940 


-0.32 


- 


- 


Cs 


-0.00683 


-6.13 


-0.00892 


-2.34 



' From ASHRAE (2005) © American Society of Heating, Refrigera- 
ting and Air-conditioning Engineers, Inc., www.ashrae.org. 

* The monitored data was provided by Ken Gillespie for which we are 
grateful. 



(called pre-retrofit period) and after (called post-retrofit 
period) addition of this additive was monitored for several 
months to determine whether the additive results in an im- 
provement in chiller performance, and if so, by how much. 
The same four variables described in Example 5.4.3, name- 
ly two temperatures (T^|^^ and T^^.), the chiller thermal co- 
oling load (Q^i,) and the electrical power consumed (P^jj,,^ ) 
were measured in intervals of 15 min. Note that the chiller 
COP can be deduced from the last two variables. Altoge- 
ther, there were 4,607 and 5,078 data points for the pre-and 
post periods respectively. 

Step 1: Perform Exploratory Data Analysis At the onset, an 
exploratory data analysis should be performed to determine 
the spread of the variables, and their occurrence frequencies 
during the pre- and post-periods, i.e., before and after ad- 
dition of the refrigerant additive. Further, it is important to 
ascertain whether the operating conditions during both peri- 
ods are similar or not. The eight frames in Fig. 5.27 summa- 
rize the spread and frequency of the important variables. It 
is noted that though the spreads in the operating conditions 
are similar, the frequencies are different during both peri- 
ods especially in the condenser temperatures and the chiller 
load variables. Figure 5.28 suggests that COP >COP .An 

'^ ^'^ post pre 

ANOVA test with results shown in Table 5.12 and Fig. 5.29 
also indicates that the mean of post-retrofit power use is sta- 
tistically different at 95% confidence level as compared to 
the pre-retrofit power. 

t Test to Compare Means 

Null hypothesis: mean (COPp^J = mean (COP^J 

Alternative hypothesis: mean (COP ^J 7^ mean (COP ^J 
assuming equal variances: 

t = 38.8828, p-value = 0.0 

The null hypothesis is rejected at a = 0.05. 

Of particular interest is the confidence interval for the 
difference between the means, which extends from 0.678 to 
0.750. Since the interval does not contain the value 0.0, the- 
re is a statistically significant difference between the means 
of the two samples at the 95.0% confidence level. However, 
it would be incorrect to infer that COP > COP since the 

post pre 

operating conditions are different, and thus one should not 
use this approach to draw any conclusions. Hence, a regress- 
ion model based approach is warranted. 

Step 2: Use the Entire Pre-retrofit Data to Identify a 
Model The GN chiller models (Gordon and Ng 2000) are 
described in Pr. 5.13. The monitored data is first used to 
compute the variables of the model given by the regression 
model Eqs. 5.70 and 5.71. Then, a linear regression is perfor- 
med which is given below along with standard errors of the 
coefficients shown within parenthesis: 
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Fig. 5.27 Histograms depic- 
ting tlie range of variation and 
frequency of the four important 
variables before and after the 
retrofit (pre =4,607 data points, 
post=5,078 data points). The 
condenser water temperature and 
the chiller load show much larger 
variability during the post period 
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Fig. 5.28 Histogram plots of Coefficient of Performance (COP) of 
chiller before and after retrofit. Clearly, there are several instances 
when COP >COP but that could be due to operating conditions. 

poK! pre i ^ 

Hence, a regression modeling approach is clearly warranted 



Fig. 5.29 ANOVA test results in the form of box-and-whisker plots for 
chiller COP before and after addition of refrigerant additive 
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Table 5.12 Results of the ANOVA Test of comparison of means at 
significance level of 0.05 

95.0% confidence interval for 8.573 ±0.03142 = [8.542, 8.605] 
mean of COP ,: 



95.0% confidence interval for 
mean of COP j 

95.0% confidence interval for 
the difference between the means 
assuming equal variances: 



7.859 ±0.01512 = [7.844, 7.874] 
0.714 ±0.03599 = [0.678, 0.750] 



Pre-Retrofit Chiller Power Model vs Pre Measured 




200 300 400 500 600 
Measured Pre Data (kW) 



800 



Fig. 5.30 X-Y plot of chiller power during pre-retrofit period. The 
overall fit is excellent (RMSE=9.36 kW and CV = 2.24%), and except 
for a few data points, the data seems well behaved. Total number of data 
points =4,607 



y = -0.00187 ■ xi +261.2885 • xj +0.022461 ■ X3 

(0.00163) (15.925) (0.000111) 

with adjusted R^ = 0.998 



(5.61) 



This model is then re-transformed into a model for power 
using Eq. 5.76, and the error statistics using the pre-retro- 
fit data are found to be: R]VISE=9.36 kW and CV = 2.24%. 
Figure 5.30 shows the x-y plot from which one can visual- 
ly evaluate the goodness of fit of the model. Note that the 
mean power use =41 8.7 kW while the mean model residu- 
als =0.0 17 kW (negligibly close to zero, as it should be. This 
step validates the fact that the spreadsheet cells have been 
coded correctly with the right formulas). 

Step 3: Calculate Savings in Electrical Power The above 
chiller model representative of thermal performance of the 
chiller without refrigerant additive is used to estimate savings 
by first predicting power use for each 15 min interval using 
the two operating temperatures and the load corresponding to 
the 5,078 post-retrofit data points. Subsequently, savings in 
chiller power are deduced for each of the 5,078 data points: 

Power savings — Model-predicted pre-retrofit use 
— measured post-retrofit use 

(5.62) 
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Fig. 5.31 Difference in X-Y plots of chiller power indicating that post- 
retrofit values are higher than those during pre-retrofit period (mean 
increase=21 kW or 7.88%). One can clearly distinguish two operating 
patterns in the data suggesting some intrinsic behavioral change in chil- 
ler operation. Entire data set for the post-period consisting of 5,078 
observations has been used in this analysis 



It is found that mean power savings = -21.0 kW (i.e., an in- 
crease in power use) or a decrease of 7.88% in the measured 
mean power use of 287.5 kW. Figure 5.31 visually illustrates 
the extent to which power use during the post-retrofit period 
has increased as compared to the pre-retrofit model. Over- 
looking the few outliers, one notes that there are two pat- 
terns: a larger number of data points indicating that post-re- 
trofit electricity power use was much higher and a smaller set 
when the difference is little to nil. The reason for the onset of 
two distinct patterns in operation is worthy of a subsequent 
investigation. 

Step 4: Calculate Uncertainty in Savings and Draw Conclu- 
sions The uncertainty arises from two sources: prediction 
model and power measurement errors. The latter are usually 
small, about 0.1% of the reading, which in this particular 
case is less than 1 kW. Hence, this contribution can be neg- 
lected during an initial investigation such as this one. The 
model uncertainty is given by: 

absolute uncertainty in power use savings or reduction 
= (t_value X RIVISE) (5.63) 

The t-value at 90% confidence level = 1.65 and RMSE of mo- 
del (for pre-retrofit period) = 9.36 kW. 

Hence the calculated increase in power due to refrigerant 
additive = -21.0 kW ± 15.44 kW at 90% CL. Thus, one 
would conclude that the refrigerant additive is actually pe- 
nalizing chiller performance by 7.88% since electric power 
use is increased. 

Note: The entire analysis was redone by cleaning the 
post-retrofit data so as to remove the dual sets of data (see 
Fig. 5.31). Even then, the same conclusion was reached. 



Problems 
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Table 5.13 Data table for Probl 


em 5.1 




















Temperature 

trc) 





10 


20 


30 


40 


50 


60 


70 


80 


90 


100 


Specific volume 
V (m'^S/kg) 


206.3 


106.4 


57.84 


32.93 


19.55 


12.05 


7.679 


5.046 


3.409 


2.361 


1.673 


Sat. vapor enthalpy 
kJ/kg 


2501.6 


2519.9 


2538.2 


2556.4 


2574.4 


2592.2 


2609.7 


2626.9 


2643.8 


2660.1 


2676 



Problems 

Pr. 5.1 Table 5.13 lists various properties of saturated water 
in the temperature range 0-100°C. 

(a) Investigate first order and second-order polynomials 
that fit saturated vapor enthalpy to temperature in °C. 
Identify the better model by looking at R^, RMSE and 
CV values for both models. Predict the value of satura- 
ted vapor enthalpy at 30°C along with 95% confidence 
intervals and 95% prediction intervals. 

(b) Repeat the above analysis for specific volume but in- 
vestigate third-order polynomial fits as well. Predict the 
value of specific volume at 30°C along with 95% confi- 
dence intervals and 95% prediction intervals. 

Pr. 5.2 Tensile tests on a steel specimen yielded the results 
shown in Table 5.14. 

(a) Assuming the regression of y on x to be linear, estimate 
the parameters of the regression line and determine the 
95%) confidence limits for x=4.5 

(b) Now regress x on y, and estimate the parameters of the 
regression line. For the same value of y predicted in (a) 
above, determine the value of x. Compare this value 
with the value of 4.5 assumed in (a). If different, discuss 
why. 

(c) Compare the R- and CV values of both models. 

(d) Plot the residuals of both models 

(e) Of the two models, which is preferable for OLS estima- 
tion. 

Pr. 5.3 The yield of a chemical process was measured at 
three temperatures (in °C), each with two concentrations of a 
particular reactant, as recorded in Table 5.15. 
(a) Use OLS to find the best values of the coefficients a, b, 
and c in the equation: y=a+bt+cx. 

Table 5.14 Data table for Problem 5.2 



Tensile force X 1 2 3 


4 


5 


6 


Elongation y 15 35 41 


63 


77 


84 


Table 5.1 5 Data table for Problem 5.3 








Temperature, t 40 40 50 


50 


60 


60 


Concentration, X 0.2 0.4 0.2 


0.4 


0.2 


0.4 


Yield y 38 42 41 


46 


46 


49 



(b) Calculate the R-, RMSE, and CV of the overall model 
as well as the SB of the parameters 

(c) Using the p coefficient concept described in Sect. 5.4.5, 
determine the relative importance of the two indepen- 
dent variables on the yield. 

Pr. 5.4 Cost of electric power generation versus load factor 
and cost of coal 

The cost to an electric utility of producing power (C^j^) 
in mills per kilowatt-hr ($10"^/kWh) is a function of the load 
factor (LF) in % and the cost of coal (C^^^j) in cents per mil- 
lion Btu. Relevant data is assembled in Table 5.16. 

(a) Investigate different models (first order and second or- 
der with and without interaction terms) and identify the 
best model for predicting C^i^ vs LF and C^^^^,. Use step- 
wise regression if appropriate. (Hint: plot the data and 
look for trends first). 

(b) Perform residual analysis 

(c) Calculate the R-, RMSE, and CV of the overall model 
as well as the SE of the parameters 

Pr. 5.5 Modeling of cooling tower performance 

Manufacturers of cooling towers often present catalog 
data showing outlet-water temperature T as a function of 
ambient air wet-bulb temperature (T^^^) and range (which is 
the difference between inlet and outlet water temperatures). 
Table 5.17 assembles data for a specific cooling tower. Iden- 
tify an appropriate model (investigate first order and second 
order polynomial models for T ) by looking at R^ RMSE and 
CV values, the individual t- values of the parameters as well as 
the behavior of the overall model residuals. 

Pr. 5.6 Steady-state performance testing of solar thermal 
flat plate collector 

Solar thermal collectors are devices which convert the ra- 
diant energy from the sun into useful thermal energy that goes 
to heating, say, water for domestic or for industrial applica- 
tions. Because of low collector time constants, heat capacity 
effects are usually small compared to the hourly time step 

Table 5.16 Data table for Problem 5.4 



LF 


85 


80 


70 


74 


67 


87 


78 


73 72 69 


82 


89 


c 


15 


17 


27 


23 


20 


29 


25 


14 26 29 


24 


23 


Cb. 


4.1 


4.5 


5.6 


5.1 


5.0 


5.2 


5.3 


4.3 5.8 5.7 


4.9 


4.8 
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Table 5.1 7 Data table for Problem 5.5 






T„.rc) 


Range (°C) 20 


21.5 


23 


23.5 


26 


10 25.89 


26.65 


27.49 


27.78 


29.38 


13 26.40 


27.11 


27.90 


28.18 


29.75 


16 26.99 


27.64 


28.38 


28.66 


30.18 


19 27.65 


28.24 


28.94 


29.20 


30.69 


22 28.38 


28.92 


29.58 


29.83 


31.28 



used to drive the model. The steady-state useful energy q^^ 
delivered by a solar flat-plate collector of surface area A^, is 
given by the Hottel-Whillier-Bliss equation (Reddy 1987): 



qc 



ArF, 



Rlhrin 



UdTci - Z 



;.]- 



(5.64) 



where F^^ is called the heat removal factor and is a measure 
of the solar collector performance as a heat exchanger (since 
it can be interpreted as the ratio of actual heat transfer to the 
maximum possible heat transfer); rj is the optical efficiency 
or the product of the transmittance and absorptance of the 
cover and absorber of the collector at normal solar incidence; 
Uj^ is the overall heat loss coefficient of the collector which 
is dependent on collector design only, I.j, is the radiation in- 
tensity on the plane of the collector, T^. is the temperature of 
the fluid entering the collector, and T is the ambient tempe- 
rature. The * sign denotes that only positive values are to be 
used, which physically implies that the collector should not 
be operated if q^, is negative i.e., when the collector loses 
more heat than it can collect (which can happen under low 
radiation and high T conditions). 

Steady-state collector testing is the best manner for a ma- 
nufacturer to rate his product. From an overall heat balance 
on the collector fluid and from Eq. 5.64, the expressions for 
the instantaneous collector efficiency ;; under normal solar 
incidence are: 

(mCp)^(Tco - Ta) 



nc 



qc 

Ac It 

pRlln - FrUl 



Ac It 

Tg - Tg 

It 



(5.65) 



where m is the total fluid flow rate through the collectors, c 
is the specific heat of the fluid flowing through the collector, 
and T and T are the inlet and exit temperatures of the fluid 

Cl CO r 

to the collector. Thus, measurements (of course done as per 
the standard protocol, ASHRAE 1978) of I.j, T . and T ^ are 
done under a pre-specified and controlled value of fluid flow 
rate. The test data are plotted as t]^ against reduced tempera- 
ture [(T|^.-T^)/IJ as shown in Fig. 5.32. A linear fit is made 
to these data points by regression, from which the values of 
Fj^ ri^ and F^^ U^^ are easily deduced. 

If the same collector is testing during different days, 
slightly different numerical values are obtained for the two 



Collecter tilt angle= 45° 
Inlet temperature = 32 to 60°C 
Flow rate = 0.0136 kg/(sm2) 
Solar flux = 599 to 1 009 W/m^ 




0.12 



Fig. 5.32 Test data points of thermal efficiency of a double glazed flat- 
plate liquid collector with reduced temperature. The regression line of 
the inodel given by Eq. 5.65 is also shown. (Froin ASHRAE (1978) © 
American Society of Heating, Refrigerating and Air-conditioning Engi- 
neers, Inc., www.ashrae.org) 



parameters F^^^ and FJJ^ which are often, but not always, 
within the uncertainty bands of the estimates. Model misspeci- 
fication (i.e., the model is not perfect, for example, it is known 
that the collector heat losses are not strictly linear) is partly 
the cause of such variability. This is somewhat disconcerting 
to a manufacturer since this introduces ambiguity as to which 
values of the parameters to present in his product specification 
sheet. 

The data points of Fig. 5.32 are assembled in Table 5.18. 
Assume that water is the working fluid. 

(a) Perform OLS regression using Eq. 5.65 and identify the 
two parameters F^^^^ and F^U^ along with their standard 
errors. Plot the model residuals, and study their behavior 

(b) Draw a straight line visually through the data points and 
determine the x-axis and y-axis intercepts. Estimate the 
F^ ^^ and F^p^ parameters and compare them with those 
determined from (a). 

(c) Calculate the R-, RMSE and C V values of the model 

(d) Calculate the F-statistic to test for overall model signifi- 
cance of the model 

(e) Perform t-tests on the individual model parameters 

(f) Use the model to predict collector efficiency when 
I =800 W/m-, T .=35°C and T = 10°C 

1 Cl a 



Table 5.1 8 Data table for Problem 5.6 








X y (%) 


X 


y(%) 


X 


y(%) 


X 


y(%) 


0.009 64 


0.051 


30 


0.064 


27 


0.077 


20 


0.011 65 


0.052 


30 


0.065 


26 


0.080 


16 


0.025 56 


0.053 


31 


0.065 


24 


0.083 


14 


0.025 56 


0.056 


29 


0.069 


24 


0.086 


14 


0.025 52.5 


0.056 


29 


0.071 


23 


0.091 


12 


0.025 49 


0.061 


29 


0.071 


21 


0.094 


10 


0.050 35 


0.062 


25 


0.075 


20 
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(g) Determine the 95% CL intervals for the mean and indi- 
vidual responses for (f) above, 
(h) The steady- state model of the solar thermal collector 
assumes the heat loss term given by [UA{T -T] to be 
linear with the temperature difference between collec- 
tor inlet temperature and the ambient temperature. One 
wishes to investigate whether the model improves if the 
loss term is to include an additional second order term: 
(i) Derive the resulting expression for collector effi- 
ciency analogous to Eq. 5.65? 
(Hint: start with the fundamental heat balance 
equation — Eq. 5.64) 
(ii) Does the data justify the use of such a model? 

Pr. 5.7' Dimensionless model for fans or pumps 
The performance of a fan or pump is characterized in terms 
of the head or the pressure rise across the device and the flow 
rate for a given shaft power. The use of dimensionless variab- 
les simplifies and generalizes the model. Dimensional ana- 
lysis (consistent with fan affinity laws for changes in speed, 
diameter and air density) suggests that the performance of a 
centrifugal fan can be expressed as a function of two dimen- 
sionless groups representing flow coefficient and pressure 
head respectively: 



^ = 



SP Q 

and <i>—- 



D^o?p 



D^o) 



(5.66) 



where SP is the static pressure, Pa; D the diameter of wheel, 
m; oj the rotative speed, rad/s; p the density, kg/m^ and Q the 
volume flow rate of air, mVs. 

For a fan operating at constant density, it should be possi- 
ble to plot one curve ^I* vs O that represents the performance 
at all speeds. The performance of a certain 0.3 m diameter 
fan is shown in Table 5.19. 



Table 5.19 


Data table for Problem 


5.7 






Rotation 
CO (Rad/s) 


Flow rate 
Q (m-Vs) 


Static 
pressure 
SP (Pa) 


Rotation 
m (Rad/s) 


Flow rate 
Q (mVs) 


Static 
pressure 
SP (Pa) 


157 


1.42 


861 


94 


0.94 


304 


157 


1.89 


861 


94 


1.27 


299 


157 


2.36 


796 


94 


1.89 


219 


157 


2.83 


694 


94 


2.22 


134 


157 


3.02 


635 


94 


2.36 


100 


157 


3.30 


525 


63 


0.80 


134 


126 


1.42 


548 


63 


1.04 


122 


126 


1.79 


530 


63 


1.42 


70 


126 


2.17 


473 


63 


1.51 


55 


126 


2.36 


428 




126 


2.60 


351 




126 


3.30 


114 





(a) First, plot the data and formulate two or three promising 
functions. 

(b) Identify the best function by looking at the R^ RMSE 
and CV values and also at the residuals. 

Assume density of air at STP conditions to be 1.204 kg/m^ 

Pr. 5.8 Consider the data used in Example 5.6.3 meant to 
illustrate the use of weighted regression for replicate measu- 
rements with non-constant variance. For the same data set, 
identify a model using the logarithmic transform approach 
similar to that shown in Example 5.6.2 

Pr. 5.9 Spline models for solar radiation 
This problem involves using splines for functions with ab- 
rupt hinge points. Several studies have proposed correlations 
to predict different components of solar radiation from more 
routinely measured components. One such correlation relates 
the fraction of hourly diffuse solar radiation on a horizontal 
radiation (I^) and the global radiation on a horizontal surface 
(I) to a quantity known as the hourly atmospheric clearness 
index {k^=I/I^ where I^ is the extraterrestrial hourly radiation 
on a horizontal surface at the same latitude and time and day 
of the year (Reddy 1987). The latter is an astronomical quan- 
tity and can be predicted almost exactly. Data has been gathe- 
red (Table 5.20) from which a correlation between (///) =f{k^) 
needs to be identified. 



(a) 



(b) 



Plot the data and visually determine likely locations of 
hinge points. (Hint: there should be two points, one at 
either extreme). 

Previous studies have suggested the following three 
functional forms: a constant model for the lower ran- 
ge, a second order for the middle range, and a cons- 
tant model for the higher range. Evaluate with the data 
provided whether this functional form still holds, and 
report pertinent models and relevant goodness-of-fit 
indices. 



From Stoecker (1989) by permission of McGraw-Hill. 



Table 5.20 


Data table for Problem 5.9 




^. 


ilJD 


K 


(■IJ^ 


0.1 


0.991 


0.5 


0.658 


0.15 


0.987 


0.55 


0.55 


0.2 


0.982 


0.6 


0.439 


0.25 


0.978 


0.65 


0.333 


0.3 


0.947 


0.7 


0.244 


0.35 


0.903 


0.75 


0.183 


0.4 


0.839 


0.8 


0.164 


0.45 


0.756 


0.85 


0.166 




0.9 


0.165 
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Table 5.21 Data table for Problem 5.10 
Balance point 25 20 15 

temp. CO 



10 



VBDD (°C-Days) 4,750 3,900 2,000 1,100 500 100 



Pr. 5.10 Modeling variable base degree-days with balance 
point temperature at a specific location 

Degree-day methods provide a simple means of determin- 
ing annual energy use in envelope-dominated buildings ope- 
rated constantly and with simple HVAC systems which can 
be characterized by a constant efficiency. Such simple sing- 
le-measure methods capture the severity of the climate in a 
particular location. The variable base degree day (VBDD) is 
conceptually similar to the simple degree-day method but is 
an improvement since it is based on the actual balance point 
of the house instead of the outdated default value of 65 °F or 
18.3°C (ASHRAE 2009). Table 5.21 assembles the VBDD 
values for New York City, NY from actual climatic data over 
several years at this location. 

Identify a suitable regression curve for VBDD versus ba- 
lance point temperature for this location and report all perti- 
nent statistics. 

Pr. 5.11 Change point models of utility bills in variable occu- 
pancy buildings 

Example 5.7.1 illustrated the use of linear spline models 
to model monthly energy use in a commercial building ver- 
sus outdoor dry-bulb temperature. Such models are useful for 
several purposes, one of which is for energy conservation. For 
example, the energy manager may wish to track the extent to 
which energy use has been increasing over the years, or the 
effect of a recently implemented energy conservation measu- 
re (such as a new chiller). For such purposes, one would like 
to correct, or normalize, for any changes in weather since an 
abnormally hot summer could obscure the beneficial effects 
of a more efficient chiller. Hence, factors which change over 
the months or the years need to be considered explicitly in 
the model. Two common normalization factors include chan- 



ges to the conditioned floor area (for example, an extension to 
an existing wing), or changes in the number of students in a 
school. A model regressing monthly utility energy use against 
outdoor temperature is appropriate for buildings with cons- 
tant occupancy (such as residences) or even offices. However, 
buildings such as schools are practically closed during sum- 
mer, and hence, the occupancy rate needs to be included as the 
second regressor. The functional form of the model, in such 
cases, is a multi-variate change point model given by: 



y = /0O,M« + Pofoc + PlMiiX + PxfocX 

+ h,un{x - Xc)I + Plfocix - Xc)I 



(5.67) 



where x and y are the monthly mean outdoor temperature 
(r ) and the electricity use per square foot of the school (E) 
respectively, andf^^=NJN^^^^^j represents the fraction of days 
in the month when the school is in session (N ) to the total 
number of days in that particular month (N^^^^j). The factor 
f can be determined from the school calendar. Clearly, the 
unoccupied fraction f = 1 - f . 

The term 1 represents an indicator variable whose nume- 
rical value is given by Eq. 5.54b. Note that the change point 
temperatures for occupied and unoccupied periods are as- 
sumed to be identical since the monthly data does not allow 
this separation to be identified. 

Consider the monthly data assembled (shown in Table 
5.22). 

(a) Plot the data and look for change points in the data. 
Note that the model given by Eq. 5.67 has 7 parameters 
of which X (the change point temperature) is the one 
which makes the estimation non-linear. By inspection 
of the scatter plot, you will assume a reasonable value 
for this variable, and proceed to perform a linear re- 
gression as illustrated in Example 5.7.1. The search for 
the best value of x^ (one with minimum RMSE) would 
require several OLS regressions assuming different va- 
lues of the change point temperature. 



Table 5.22 


Data table for Example 5.11 
















Year 


Month 


E (Wm 


T (°F) 


L 


Year 


Month 


E (w/m 


T (°F) 


foe 


94 


Aug 


1.006 


78.233 


0.41 


95 


Aug 


1.351 


81.766 


0.39 


94 


Sep 


1.123 


73.686 


0.68 


95 


Sep 


1.337 


76.341 


0.71 


94 


Oct 


0.987 


66.784 


0.67 


95 


Oct 


0.987 


65.805 


0.68 


94 


Nov 


0.962 


61.037 


0.65 


95 


Nov 


0.938 


56.714 


0.66 


94 


Dec 


0.751 


52.475 


0.42 


95 


Dec 


0.751 


52.839 


0.41 


95 


Jan 


0.921 


49.373 


0.65 


96 


Jan 


0.921 


49.270 


0.65 


95 


Feb 


0.947 


53.764 


0.68 


96 


Feb 


0.947 


55.873 


0.66 


95 


Mar 


0.876 


59.197 


0.58 


96 


Mar 


0.873 


55.200 


0.57 


95 


Apr 


0.918 


65.711 


0.66 


96 


Apr 


0.993 


66.221 


0.65 


95 


May 


1.123 


73.891 


0.65 


96 


May 


1.427 


78.719 


0.64 


95 


Jun 


0.539 


77.840 





96 


Jun 


0.567 


78.382 


0.1 


95 


Jul 


0.869 


81.742 





96 


Jul 


1.005 


82.992 


0.2 
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(b) Identify the parsimonious model, and estimate the ap- 
propriate parameters of the model. Note that of the six 
parameters appearing in Eq. 5.67, some of the parame- 
ters may be statistically insignificant, and appropriate 
care should be exercised in this regard. Report appro- 
priate model and parameter statistics. 

(c) Perform a residual analysis and discuss results. 

Pr. 5.12 Determining energy savings from monitoring and 
verification (M&V) projects 

A crucial element in any energy conservation program is the 
ability to verify savings from measured energy use data — 
this is referred to as monitoring and verification (M&V). 
Energy service companies (ESCOs) are required, in most 
cases, to perform this as part of their services. Figure 5.33 
depicts how energy savings are estimated. A common M&V 
protocol involves measuring the monthly total energy use 
at the facility for whole year before the retrofit (this is the 
baseline period or the pre-retrofit period) and a whole year 
after the retrofit (called the post-retrofit period). The time 
taken for implementing the energy saving measures (called 
the "construction period") is neglected in this simple exam- 
ple. One first identifies a baseline regression model of energy 
use against ambient dry-bulb temperature T^ during the pre- 
retrofit period E^^^^=f{TJ. This model is then used to predict 
energy use during each month of the post-retrofit period by 
using the corresponding ambient temperature values. The 
difference between model predicted and measured monthly 
energy use is the energy savings during that month. 



Energy savings —Model -predicted pre-retrofit use 
— measured post-retrofit use 



(5.68) 



The determination of the annual savings resulting from the 
energy retrofit and its uncertainty are finally determined. It 
is very important that the uncertainty associated with the sa- 
vings estimates be determined as well for meaningful con- 
clusions to be reached regarding the impact of the retrofit on 
energy use. 

You are given monthly data of outdoor dry bulb tempera- 
ture (T ) and area-normalized whole building electricity use 
WB ) for two years (Table 5.23). The first year is the pre- 
retrofit period before a new energy management and control 
system (EMCS) for the building is installed, and the second 
is the post-retrofit period. Construction period, i.e., the peri- 
od it takes to implement the conservation measures is taken 
to be negligible. 

(a) Plot time series and x-y plots and see whether you can 
visually distinguish the change in energy use as a result 
of installing the EMCS (similar to Fig. 5.33); 

(b) Evaluate at least two different models (with one of them 
being a model with indicator variables) for the pre-re- 
trofit period, and select the better model; 



Table 5.23 Data table for Problem 5.12 






Pre-retrofit period 




Post-retrofit period 




Month 


T (°F) 


(W/ft^) 


Month 


T (°F) 


(W/ft^) 


1994- Jul 


84.04 


3.289 


1995-Jul 


83.63 


2.362 


Aug 


81.26 


2.827 


Aug 


83.69 


2.732 


Sep 


77.98 


2.675 


Sep 


80.99 


2.695 


Oct 


71.94 


1.908 


Oct 


72.04 


1.524 


Nov 


66.80 


1.514 


Nov 


62.75 


1.109 


Dec 


58.68 


1.073 


Dec 


57.81 


0.937 


1995-Jan 


56.57 


1.237 


1996- Jan 


54.32 


1.015 


Feb 


60.35 


1.253 


Feb 


59.53 


1.119 


Mar 


62.70 


1.318 


Mar 


58.70 


1.016 


Apr 


69.29 


1.584 


Apr 


68.28 


1.364 


May 


77.14 


2.474 


May 


78.12 


2.208 


Jun 


80.54 


2.356 


Jun 


80.91 


2.070 



(c) 



(d) 



(e) 



Use this baseline model to determine month-by-month 
energy use during the post-retrofit period representative 
of energy use had not the conservation measure been 
implemented; 

Determine the month-by-month as well as the annual 
energy savings (this is the "model-predicted pre-retrofit 
energy use" of Eq. 5.68); 

The ESCO which suggested and implemented the ECM 
claims a savings of 15%. You have been retained by the 
building owner as an independent M&V consultant to 
verify this claim. Prepare a short report describing your 
analysis methodology, results and conclusions. (Note: 
you should also calculate the 90% uncertainty in the 
savings estimated assuming zero measurement uncer- 
tainty. Only the cumulative annual savings and their 
uncertainty are required, not month-by-month values). 



Pr. 5.13'° Grey-box and black-box models of centrifugal 
chiller using field data 

You are asked to evaluate two types of models: physical 
or gray-box models versus polynomial or black-box models. 
A brief overview of these is provided below. 

(a) Gray-Box Models The Universal Thermodynamic Mo- 
del proposed by Gordon and Ng (2000) is to be used. The 
GN model is a simple, analytical, universal model for chil- 
ler performance based on first principles of thermodynamics 
and linearized heat losses. The model predicts the dependent 
chiller COP (defined as the ratio of chiller (or evaporator) 
thermal cooling capacity 2.,_ by the electrical power P.^^__ 
consumed by the chiller (or compressor) with specially cho- 
sen independent (and easily measurable) parameters such as 
the fluid (water or air) inlet temperature from the condenser 



"Data for this problem is given in Appendix B. 
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Fig. 5.33 Schematic represen- 
tation of energy use prior to and 
after installing energy conserva- 
tion measures (ECM) and of the 
resulting energy savings 



Post-retrofit period 




EClVI installed 



Time 



T , fluid temperature leaving the evaporator (or the chilled Eq. 5.69 assumes the following linear form: 



water return temperature from the building) T^ , and the 
thermal cooling capacity of the evaporator (similar to the fi- 
gure for Example 5.4.3). The GN model is a three-parameter 
model which, for parameter identification, takes the follo- 
wing form: 



1 



COP 



^ cko 

Tcdi 



fll 



^ cho 
Qch 



■02- 



(Tcdi 



Tcdi Q 



1 



Tcho) , (l/COP + l)Qck 

+ «3 



(5.69) 



' cdi 



y = aiXi -\- a2X2 + ^3X3 



(5.71) 



Although most commercial chillers are designed and in- 
stalled to operate at constant coolant flow rates, variab- 
le condenser water flow operation (as well as evaporator 
flow rate) is being increasingly used to improve overall 
cooling plant efficiency especially at low loads. In order 
to accurately correlate chiller model performance under 
variable condenser flow, an analytic model as follows was 
developed: 



where the temperatures are in absolute units, and the pa- 
rameters of the model have physical meaning in terms of ir- 
reversibilities: 

a^=As, the total internal entropy production rate in the chil- 
ler due to internal irreversibilities, 
'^2~Qimk' ^^^ ^^^^ '^^ ^^^^ losses (or gains) from (or in to) the 
chiller, 

1 I- E, 



a^ = R — 



(mCE)^ 



(mCEX 



i.e., the total heat ex- 



changer thermal resistance which represents the irreversi- 
bility due to finite-rate heat exchanger, and m is the mass 
flow rate, C the specific heat of water, and E is the heat 
exchanger effectiveness. 

The model applies both to unitary and large chillers ope- 
rating under steady state conditions. Evaluations by several 
researchers have shown this model to be very accurate for 
a large number of chiller types and sizes. If one introduces: 



7;ai + \/cop) 

Tcdi 



1 {\/COP + \)Q 



ch 



iypC)cond 



^ cho , / ^cdi ^ cho 

— C\ 1- C2 I 

Qch \ QchTcdi 



If one introduces 



C3- 



Tcdi 
Qchil + \/COP) 



Tcdi 



(5.72) 



x\ 



and 



-'c/io ^ cdi ^ cho 

Qcli Qch Tcdi 



Xi 



{\/COP + \)Q 



ch 



' cdi 



Tcho{\/COP + \) , 

y = T ^ 

^ cdi 



1 {\/COP + \)Qch 



(5.73) 



iypC)cond 



^cdi 



\Cch ^cdi \cch ^cdi 



and y = 



COP J Tcdi 



(5.70) 



where V, p and c are the volumetric flow rate, the density and 
specific heat of the condenser water. 

For the variable condenser flow rate, Eq. 5.72 becomes 



y — cixi + C2X2 + C3X3 



(5.74) 
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(b) Black-Box Models Whereas the structure of a gray box 
model, Hke the GN model, is determined from the under- 
lying physics, the black box model is characterized as ha- 
ving no (or sparse) information about the physical problem 
incorporated in the model structure. The model is regarded 
as a black box and describes an empirical relationship bet- 
ween input and output variables. The commercially available 
DOE-2 building energy simulation model (DOE-2 1993) re- 
lies on the same parameters as those for the physical model, 
but uses a second order linear polynomial model instead. 
This "standard" empirical model (also called a multivariate 
polynomial linear model or MLR) has 10 coefficients which 
need to be identified from monitored data: 



C0P^bi,+b,T,di+b2T,ho 
+ biQch 

+ hTcdiTcho + bgTcdi Qch + bgTchoQch 



b4Tcdi^+b,Ti^ + b,Ql, (5-75) 



These coefficients, unlike the three coefficients appearing 
in the GN model, have no physical meaning and their magni- 
tude cannot be interpreted in physical terms. Collinearity in 
regressors and ill-behaved residual behavior are also proble- 
matic issues. Usually one needs to retain in the model only 
those parameters which are statistically significant, and this 
is best done by step-wise regression. 

Table B.3 in Appendix B assembles data consisting of 52 
sets of observations from a 387 ton centrifugal chiller with 
variable condenser flow data. A sample hold-out cross-valida- 
tion scheme will be used to guard against over- fitting. Though 
this is a severe type of split, use the first 36 data points as 
training data and the rest (shown in italics) as testing data. 

(a) You will use the three models described above 
(Eqs. 5.71, 5.74 and 5.75) to identify suitable regress- 
ion models. Study residual behavior as well as collinea- 
rity issues between regressors. Identify the best forms 
of the GN and the MLR model formulations. 

(b) Evaluate which of these models is superior in terms of 
their external prediction accuracy The GN and MLR 
models have different y-values and so you cannot use 
the statistics provided by the regression package di- 
rectly. You need to perform subsequent calculations in 
a spreadsheet using the power as the basis of compa- 
ring model accuracy and reporting internal and external 
prediction accuracies. For the MLR model, this is easi- 
ly deduced from the model predicted COP values. For 
the GN model with constant flow, rearranging terms of 
Eq. 5.71 yields the following expression for the chiller 
electric power V^^. 



p — 

^ comp — 

QchiTcdi — Tcho) + a\TcdiTcho + a2(Tcdi — Tcho) + a^Q ^h 



(c) Report all pertinent steps performed in your analysis 
and present your results succinctly. 

Helpful tips: 

(i) Convert temperatures into degrees Celsius, Q^^ into kW 
and volumetric flow rate V into L/s for unit consistency 
(work in SI units) 

(ii) For the GN model, all temperatures should be in abso- 
lute units 

(iii) Degrees of freedom (d.f ) have to be estimated correctly 
in order to compute RMSE and CV. For internal predic- 
tion, d.f =n-p where n is the number of data points and 
p the number of model parameters. For external predic- 
tion accuracy, d.f =m where m is the number of data 
points. 

Pr. 5.14" Effect of tube cleaning in reducing chiller fou- 
ling 

A widespread problem with liquid-cooled chillers is con- 
denser fouling which increases heat transfer resistance in the 
condenser and results in reduced chiller COP. A common re- 
medy is to periodically (every year or so) brush-clean the in- 
sides of the condenser tubes. Some practitioners question the 
efficacy of this process though this is widely adopted in the 
chiller service industry. In an effort to clarify this ambiguity, 
an actual large chiller (with refrigerant Rl 1) was monitored 
during normal operation for 3 days before (9/11-9/13-2000) 
and 3 days after (1/17-1/19-2001) tube cleaning was done. 
Table B.4 (in Appendix B) assembles the entire data set of 
72 observations for each period. This chiller is similar to the 
figure for Example 5.4.3. 

Analyze, using the GN model described in Pr. 5.13, the 
two data sets, and determine the extent to which the COP of 
the chiller has improved as a result of this action. Prepare a 
report describing your analysis methodology, your analysis 
results, the uncertainty in your results, your conclusions, and 
any suggestions for future analysis work. 
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The data from which performance models are identified may 
originate either from planned experiments or from non-in- 
trusive (or observational) data gathered while the system is 
in normal operation. A large body of well accepted practi- 
ces is available for the former which falls under the general 
terminology of "Design of Experiments" (DOE). This is the 
process of defining the structural framework, i.e., prescri- 
bing the exact manner in which samples for testing need to 
be selected, and the conditions and sequence under which 
the testing needs to be performed. This would provide the 
"richness" in the data set necessary for statistically sound 
performance models to be identified between the response 
variable and the several categorical factors. Experimental 
design methods, which allow extending hypothesis testing to 
multiple variables as well as identifying sound performance 
models, are presented. Selected experimental design met- 
hods are discussed such as randomized block, Latin Squares 
and 2^ factorial designs. The parallel between model build- 
ing in a DOE framework and linear multiple regression is 
illustrated. Finally, this chapter addresses response surface 
methods (RSM) which allow accelerating the search towards 
optimizing a process or towards finding the conditions under 
which a desirable behavior of a product is optimized. RSM 
is a sequential approach where one starts with test conditions 
in a plausible area of the search space, analyzes test results to 
determine the optimal direction to move, performs a second 
set of test conditions, and so on till the required optimum is 
reached. 



6.1 Background 

The two previous chapters dealt with statistical techniques 
for analyzing data which was already gathered. However, no 
amount of "creative" statistical data analysis can reveal in- 
formation not available in the data itself. Thus, the process 
by which this data is gathered is itself an equally important 
field of study. The process of proper planning and execution 
of experiments, intentionally designed to provide data rich 



in information especially suited for the intended objective, 
is referred to as experimental design. Optimal experimental 
design is one which stipulates the conditions under which 
each observation should be taken so as to minimize/maximi- 
ze certain optimal constraints (say, the bias and variance of 
the parameter estimators). Practical considerations and cons- 
traints often complicate the design of optimal experiments 
and these factors should also be explicitly factored in. 

One needs to differentiate between two conditions under 
which data can be collected. On one hand, one can have a 
controlled setting where the various variables of interest can 
be altered by the experimenter. In such a case, referred to as 
intrusive testing, one can plan an "optimal" experiment where 
one can adjust the inputs and boundary or initial conditions 
as well as choose the number and location of the sensors so 
as to minimize the effect of errors on estimated values of the 
parameters. On the other hand, one may be in a situation whe- 
re one is a mere "spectator", i.e., the system or phenomenon 
cannot be controlled, and the data is collected under non-ex- 
perimental conditions (as is the case of astronomical observa- 
tions). Such an experimental protocol, known as non-intrusi- 
ve identification, is usually not the best approach. In certain 
cases, the driving forces may be so weak or repetitive that 
even when a "long" data set is used for identification, a strong 
enough or varied output signal cannot be elicited for proper 
statistical treatment (see Chap. 10). An intrusive or controlled 
experimental protocol, wherein the system is artificially stres- 
sed to elicit a strong response, is more likely to yield robust 
and accurate models and their parameter estimates. However, 
in some cases, the type and operation of the system may not 
allow such intrusive experiments to be performed. 

One should appreciate differences between measurements 
made in a laboratory setting and in the field. The potential 
for errors, both bias and random, is usually much greater in 
the latter. Not only can measurements made on a piece of 
laboratory equipment be better designed and closely control- 
led, but they will be more accurate as well because more 
expensive sensors can be selected and placed correctly in the 
system. For example, proper flow measurement requires that 
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the flowmeter be placed 30 pipe diameters after a bend. A 
laboratory set-up can be designed accordingly, while field 
conditions may not allow such conditions to be met satis- 
factorily. Further, systems being operated in the field may 
not allow controlled tests to be performed, and one has to 
develop a model or make decisions based on what one can 
observe. 

Experimental design techniques were developed about 
100 years back primarily in the context of agricultural rese- 
arch, subsequently migrating to industrial engineering, and 
then on to other fields. The historic reason for their develop- 
ment was to allow ascertaining, by hypothesis testing, whet- 
her a certain treatment, which could be an additive nutrient 
such as a fertilizer or a design change such as an alloy modi- 
fication for a machine part, increased the yield or improved 
the product. The statistical techniques which stipulate how 
each of the independent variables ot factors have to be varied 
so as to obtain the most information about system behavior 
quantified by a response variable, and do so with a mini- 
mum of tests (and hence, least effort and expense), are called 
experimental designs or design of experiments (DOE). To 
rephrase, DOE involves the complete reasoning process of 
defining the structural framework, i.e., prescribing the exact 
manner in which samples for testing need to be selected, and 
the conditions and sequence under which the testing needs to 
be performed under specific restrictions imposed by space, 
time and nature of the process (Mandel 1964). 

The applications of DOE have expanded to the area of 
model building as well. It is now used to identify which sub- 
sets among several variables influence the response variable, 
and to determine a quantitative relationship between them. 
Generally, DOE and model building involve three issues: 

(a) to "screen" a large number of possible candidates or li- 
kely variables, and select the dominant variables, which 
are referred to diS factors in experimental design termin- 
ology. These possible candidate factors are then subject 
to more extensive investigation; 

(b) to formulate how the tests need to be carried out so that 
sources of unsuspecting and uncontrollable/extraneous 
errors can be minimized, while eliciting the necessary 
"richness" in system behavior. The richness of the data 
set cannot be ascertained by the amount of data but by 
the extent to which all possible states of the system are 
represented in the data set. This is especially true for 
observational data sets where data is collected while the 
system is under routine day-to-day operation without 
any external intervention by the observer. However, un- 
der controlled test operation, such as in a laboratory, 
DOE allows optimal model identification to be achie- 
ved with the least effort and expense; 

(c) to build a suitable model between the factors and the 
response variable using the data set acquired previously. 
This involves both hypothesis testing so as to identify 



the significant factors as well as the model building and 
residual diagnostic checking phases. 
The relative importance of the three issues depends on the 
specific circumstance. For example, often, and especially so 
in engineering model building, the dominant regressor set is 
know beforehand, acquired either from mechanistic insights 
or prior experimentation, and so the screening phase may be 
redundant. In the context of calibrating a detailed simulation 
program with monitored data, the problem involves dozens, 
if not hundreds, of input parameters. Which parameters are 
best tuned and which left alone can be determined from a 
sensitivity analysis which is directly based on the principles 
of screening tests in DOE. 



6.2 Complete and Incomplete Block Designs 

6.2.1 Randomized Complete Block Designs 

Recall that the various hypothesis tests presented in Chap. 4 
dealt mainly with problems involving one variable or factor 
only. The design of experiments can be extended to include: 
(i) several factors, but only one, two and three factors will 
be discussed for the sake of simplicity. The factors could 
be either controllable by the experimenter or cannot be 
controlled (also called, extraneous or nuisance factors). 
The source of variation of controllable variables can be 
reduced by fixing them at preselected levels, while that 
of the uncontrollable variables requires suitable experi- 
mental procedures-this process is called blocking, and 
is described below, 
(ii) several levels (or treatments, a term widely used in DOE 
terminology), but only two-four levels will be conside- 
red for conceptual simplicity. The levels of a factor are 
the different values or categories which the factor can 
assume. This can be dictated by the type of variable 
(which can be continuous or discrete or categorical) or 
selected by the experimenter. Different combinations of 
the levels of the factors are called experimental units. 
Note that though the factors can be either continuous or 
categorical, the initial design of experiments treats them in 
the same fashion. In case of continuous variables, their range 
of variation is discretized into a relatively small set of nume- 
rical values; as a result, the levels have a magnitude associa- 
ted with them. This is not the case with categorical variab- 
les, such as say, male or female where there is no magnitude 
involved; the grouping is done based on some classification 
criterion such as a level or treatment. 

It is important to keep in mind that the intent of the ex- 
perimental design process, as stated earlier, is to provide the 
necessary "richness" in the data set in order to: (i) perform 
hypothesis testing of several factors at different levels, and 
(ii) to identify statistically sound performance models bet- 
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ween the response variable and the several factors. This is 
done using ANOVA techniques. If a statistical model descri- 
bing the impact of a single variable or factor on the depen- 
dent or response variable y is to be identified, the one-way 
ANOVA (described in Sect. 4.3.1) allows one to ascertain 
whether there are statistically significant differences between 
the mean at different levels of x if x is a continuous variable, 
or at different categories of x if x is a categorical variable. If 
two or more variables are involved, the procedure is called 
multif actor ANOVA . 

The simplest type of design is the randomized unrestric- 
ted block design which involves selecting at random the com- 
binations of the factors or levels under which to perform the 
experiments. This type of design, if done naively, is not very 
efficient and may require an unnecessarily large number of 
experiments to be performed. The concept is illustrated with 
a simple example, from the agricultural area, from which 
DOE emerged. Say, one wishes to evaluate the yield of four 
newly developed varieties of wheat (labeled x^, x^, x^, x^). 
Four plots of land or field stations (or experimental units) 
are available, which unfortunately, are located in 4 different 
geographic regions which differ in climate. The geographic 
location is likely to affect yield. The simplest way of assig- 
ning which station will be planted by which variety of wheat 
is to do so randomly — one such result (among many possible 
ones) is shown in Table 6.1. Note that the intent of this inves- 
tigation is to determine the effect of wheat variety on yield. 
The location of the field or station is a "nuisance" variable, 
but in this case can be controlled; this is achieved by suita- 
ble blocking. However, there may be other variables which 
are uncontrollable (say, excessive rainfall in one of regions 
during the test) and this can be partially compensated for by 
assuming them to be random and replicating or repeating the 
tests more than once for each combination. The above exam- 
ple serves to illustrate the principle of restricted randomiza- 
tion by blocking the variability in an uncontrollable effect. 

Such an unrestricted randomization leads to needless 
replication (for example, wheat type x^ is tested twice in 
Region 1 and not at all in Region 4) and may not be very 
efficient. Since the intention is to reduce variability in the un- 
controlled variable, in this case, the "region" variable, (note 
that the station to station difference in a region also exists 
but it will be less important and can be viewed as the error 



Table 6.1 Example of unrestricted randomized block design (one of 
several possibilities) for one factor of interest at levels Xj, x^, x^, x^^ and 
one nuisance variable (Regions 1, 2, 3, 4). This is not an efficient design 
since x^ appears twice under Region 1 and not at all in Region 4 



Table 6.2 Example of restricted randomized block design (one of se- 
veral possibilities) for the same example of Table 6.1 
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or random effect), one can insist that each variety of wheat 
be tested in each region. There are again several possibilities, 
with one being shown in Table 6.2. 

The approach can be extended to the treatment of two or 
multiple factors, and are caWedfactorial designs (Devore and 
Farnum 2005). Consider two factors (labeled A and B) which 
are to be studied at "a and b" levels respectively. This is of- 
ten referred to as (a x b) design, and the standard manner of 
representing the results of the tests is by assembling them 
as shown in Table 6.3. Each combination of factor-level can 
be tested more than once in order to minimize the effect of 
random errors, and this is called replication. Though more 
tests are done, replication reduces experimental errors intro- 
duced by extraneous factors not explicitly controlled during 
the experiments that can bias the results. Often for mathe- 
matical convenience, each combination is tested at the same 
replication level, and this is called a balanced design. Thus, 
Table 6.3 is an example of a (3 x 2) balanced design with re- 
plication r=2. 

The above terms are perhaps better understood in the con- 
text of regression analysis (treated in Chap. 5). Let Z be the 
response variable which is linear in regressor variable X, and 
a model needs to be identified. Further, say, another variable 
Y is known to influence Z which may smear the sought after 
relation (such as the field station variable in the above exam- 
ple). Selecting three specific values is akin to selecting three 
levels for the factor X (say, Xj, x^, x^). The nuisance effect of 
variable Y can be "blocked" by performing the tests at pre- 
selected /(xecf levels or values of Y (say, y^ and y^). The cor- 
responding scatter plot is shown in Fig. 6.1. Repeat testing at 
each of the six combinations in order to reduce experimental 
errors is akin to replication; in this example, replication r=3. 
Finally, if the 18 tests are performed in random sequence, the 
experimental design would qualify as a full factorial random 
design. 



Table 6.3 Standard method of assembling test results for a balanced 
(3 X 2) design with two replication levels 







Factor B 




Average 






Level 1 


Level 2 




Factor A 


Level 1 


10,14 


18,14 


14 




Level 2 


23,21 


16,20 


20 




Level 3 


31,27 


21,25 


26 


Average 




21 


19 


20 
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X1 



X2 



Fig. 6.1 Correspondence between block design approach and multiple 
regression analysis 



The averages shown in Table 6.3 correspond to those of 
either the associated row or the associated column. Thus, 
the average of the first row, i.e. { 10, 14, 18, 14}, is shown 
as 14, and so on. Plots of the average response versus the 
levels of a factor yield a graph which depicts the trend, cal- 
led main ejfect of the factor. Thus, Fig. 6.2a suggests that 
the average response tends to increase linearly as factor 
A changes from A^ to A^, while that of factor B decreases 
a little as factor B changes from B^ to B^. The effect of 
the factors on the response may not be purely additive, a 
multiplicative component may be included as well. In such 



cases, the two factors are said to interact with each other. 
Whether this interaction effect is statistically significant 
or not can be determined by performing the calculations 
shown in Table 6.4. 

Thus, the effect of going from A^ to A^ is 17 under B^ 
and only 7 under B,. This suggests interaction effects. A 
simpler and more direct approach is to graph the two factor 
interaction plot as shown in Fig. 6.2b. Since the lines are 
not parallel (in this case they cross each other), one would 
infer strong interaction between the two factors. However, 
in many instances, such plots are not conclusive enough, 
and one needs to perform statistical tests to determine 
whether the main or the interaction effects are significant 
or not. Figure 6.3 shows the type of interaction plots one 
would obtain for the case when the interaction effects are 
not significant. 

ANOVA decompositions allow breaking up the obser- 
ved total sum of square variation (SST) into its various 
contributing causes (the one factor ANOVA was described 
in Sect. 4.3.1). For a two-factor ANOVA decomposition 
(Devore and Farnum 2005): 



SST = SSA + SSB + SS (AB) + SSE 
where observed sum of squares: 

a b r 

^^^ = EEE(^'V'«-<3'>)' 

,■=1 j = \ m = \ 

= (stdevfiabr - 1) 
sum of squares associated with factor A: 



(6.1) 



Fig. 6.2 Plots for the (3x2) 
balanced factorial design, a Main 
effects of factors A and B with 
mean and 95% intervals (data 
from Table 6.3). b Two-factor 
interaction plot (data from 
Table 6.4) 
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Table 6.4 Interaction effect 
calculations 



Effect of changing A (B fixed at Bj) 




Effect of changing A (B fixed at B^) 




AjandBj 


(10+14)/2 = 12 


29-12=17 


Aj and B^ 


(18 + 14)72 = 16 




A, andB, 


(23 + 21)72 = 22 


A, and B^ 


(16 + 20)72=18 


23-16 = 7 


AjandBj 


(31+27)72 = 29 


A3 and Bj 


(21+25)72 = 23 






Fig. 6.3 An example of a two-factor interaction plot when the factors 
have no interaction 



SSA = b.r. ^ (A, - <y> f 

(=1 

sum of squares associated with factor B: 



SSB = fl.r. ^(B,- - <y>f 
error or residual sum of squares: 



- \2 



a h r 
/—I /=! m=l 



sum of square associated with the AB interaction is 
SST(AB) 



j = 1 , ... b is the index for levels of factor B 
m= 1 , ... r is the index for replicate. 

Note that SS(AB) is deduced from Eq. 6.1 since all other 
quantities can be calculated. A linear statistical model, refer- 
red to as a random effects model, between the response and 
the two factors which includes the interaction term between 
factors A and B can be deduced. More specifically, this is 
called a non-additive two-factor model (it is non-additive 
because of the interaction term present) which assumes the 
following form since one starts with the grand average and 
adds individual effects of the factors, the interaction terms 
and the noise or error term: 

yij = <y> -\-ai+Pj-^ {aP)ij + £,•; (6.2) 

where 

a. represents the main effect of factor A at the i* level 

a 

— Aj - < J > and = ^ «,■ = 
(=1 

p. the main effect of factor B at the j* level — Bj — <y > 

and =J2Pj^0 

7=1 
(a/?) the interaction between factors A and B 

^yy-(.<y> +Ai+ Bj) and = E E ("Phj = 

and e is the error (or residuals) assumed uncorrelated with 
mean zero and variance a^=MSE. 

The analysis of variance is done as described earlier, but 
care must be taken to use the correct degrees of freedom to 
calculate the mean squares (refer to Table 6.5). The analy- 
sis of variance model (Eq. 6.2) can be viewed as a special 
case of multiple linear regression (or more specifically to one 
with indicator variables — see Sect. 5.7.3). This is illustrated 
in the example below. 



with y.. = observation under m* replication when A is at le- 
vel i and B is at level j 
a = number of levels of factor A 
b= number of levels of factor B 
r= number of replications per cell 
Aj = average of all response values at i* level of factor A 
5j = average of all response values at j* level of factor B 
ytj = average for each cell (i.e., across replications) 
<y>= grand average 
i= 1, ... a is the index for levels of factor A 



Example 6.2.1 : Two-factor ANOVA analysis and random 
effect model fitting 

Using the data from Table 6.3, determine whether the main 
effect of factor A, main effect of factor B, and the interaction 
effect of AB are statistically significant at a=0.05. Subse- 
quently, identify the random effects model. 

First, using all 12 observations, one computes the grand 
average <y>=20 and the standard deviation stdev = 6.015. 
Then, following Eq. 6. 1 : 
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Table 6.5 Computational procedure 


for a two-factor ANOVA design 










Source of Variation 


Sum of 
Squares 


Degrees of 
Freedom 


Mean squai'e 






Computed F statistic 


Degrees of Freedom for 
p- value 


Factor A 


SSA 


a-1 


MSA=SSA/(a-l) 






MSAMSE 


a-l,ab(r-l) 


Factor B 


SSB 


b-1 


MSB = SSB/(b-l) 






MSB/MSE 


b-l,ab(r-l) 


AB interaction 


SS(AB) 


(a-l)(b-l) 


MS(AB) = SS(AB)/(a- 


l)(b- 


-1) 


MS(AB)/MSE 


(a-l)(b-l), ab(r-l) 


Error 


SSE 


ab(r-l) 


MSE=SSE/[ab(r-l)] 






- 


- 


Total variation 


SST 


abr-1 


- 






- 


- 



SST = stdev^iabr - 1) = 6.0152[(3)(2)(2) - 1] 

= 398 
SSA = (2).(2)[(14 - 2Qf + (20 - 20)^ + (26 - 20)^] 

= 288 
SSB = (3).(2)[(21 - lOf + (19 - 20)2] = 12 
SSE = [(10 - Uf + (14 - nf + (18 - \6f 
+ (14 - \6f + (23 - 22f + (21 - 22)^ 
+ (16 - 18)2 + (20 - 18)2 + (31 - 29)2 
+ (27 - 29)2 _^ (21 - 23)2 + (23 - 25)2 
= 42 

Then, from 

SST = SSA + SSB + SS{AB) + SSE 
SS{AB) = SST - SSA - SSB - SSE 
= 398-288- 12-42 = 56 

Next, the expressions shown in Table 6.5 result in: 



MSA = 
MSB = 



SSA 

a — \ 
SSB 



288 

3- 1 
12 



= 144 



1 



12 



MS{AB) = 



MSE ^ 



b-\ 2 
SS{AB) _ 56 

{a - \){b - 1) ~ (2)(D 
SSE 42 



= 28 



= 7 



ab{r - 1) (3)(2)(1) 

The statistical significance of the factors can now be eva- 
luated. 

MSA _ 144 

MSE ~ ~T 



Factor A: F — value 



20.57. Since cri- 



tical F value for degrees of freedom (2, 6)=F (2,6) @ 0.05 
significance level = 5.14, and because calculated F>F, 
one concludes at the 95% confidence level that this factor 
is significant. 



• Factor B : 



value 



MSB 12 



1.71. Since 



MSE 7 
F^iyfi) @ 0.05 = 5.99; this factor is not significant. 



MSiAB) 28 

• Factor AB: F - value — ^ = — =4. Since 

MSE 1 

F (2,6) @ 0.05 = 5.14; this factor is not significant. 

The use of Eq. 6.2 can also be illustrated in terms of this 
example. The main effect of A and B are given by the dif- 
ferences between the cell averages and the grand average 
<y> = 20 (see Table 6.3): 

ai = (14 - 20) = -6; a^ = (20 - 20) = 0; 

as = (26 - 20) = 6; 

/^i = (21 - 20) = 1;/J2 = (19 - 20) = -1; 

and, the interaction terms by (refer to Table 6.4): 

(a;0)ii = 12-(2O-6+l) = -3; 
(a;0)2i=22-(2O + O+l)=l; 
(aj6)3i =29 -(20 + 6+1) = 2; 
{a^\2 = 16 - (20 - 6 - 1) = 3; 
(a;S)22 = 18-(20 + 0-l) = -l; 
(a/i)32 = 23 - (20 + 6 - 1) = -2; 

Following Eq. 6.2, the random effects model is: 



y,7=20- 



-6,0,6},- 



[1,-11 



with 



+ {-3, 1,2,3,-1, -2},-,- 
1,2,3 and y = 1,2 



(6.3) 



For example, the cell corresponding to (Al, Bl) has a mean 
value of 12 which is predicted by the above model as: 
yij =20 — 6+1 — 3 = 12, and so on. Finally, the predic- 
tion error of the model has a variance a-=MSE=l . Recas- 
ting the above model as a regression model with indicator 
variables may be insightful (though cumbersome) to those 
more familiar with regression analysis methods: 

yij = 20 + (-6)/i + (0)/2 + (6)/3 + (1)71 + (-1)72 
+ (-3)/i7i + (l)/i72 + (2)/2/i + (3)/2/2 

+ (-l)/3 7i +(-2)73/2 

where I and J are indicator variables. ■ 
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Table 6.6 Macliining time (in minutes) for Example 


6.2.2. 




Machine Operator 










Average 


1 


2 


3 


4 


5 


6 




1 42.5 


39.3 


39.6 


39.9 


42.9 


43.6 


41.300 


2 39.8 


40.1 


40.5 


42.3 


42.5 


43.1 


41.383 


3 40.2 


40.5 


41.3 


43.4 


44.9 


45.1 


42.567 


4 41.3 


42.2 


43.5 


44.2 


45.9 


42.3 


43.233 


Average 40.950 


40.525 


41.225 


42.450 


44.050 


43.525 


42.121 



Table 6.7 ANOVA table for Example 6.2.2. 



Source of 
variation 


Sum of 
Squares 


Degrees of 
Freedom 


Mean 
square 


Computed 
F statistic 


Probability 


Machines 


15.92 


3 


5.31 


3.34 


0.048 


Operators 


42.09 


5 


8.42 






Error 


23.84 


15 


1.59 






Total 


81.86 


23 


- 







Example 6.2.2:' Evaluating performance of four machines 
while blocking effect of operator dexterity 
This example will illustrate the concept of randomized com- 
plete block design with one factor. The performance of four 
different machines M,, M„ M, and M, are to be evaluated 

1 2 J 4 

in terms of speed in making a widget. It is decided that the 
same widget will be manufactured on these machines by 
six different machinists or operators in a randomized block 
experiment. The machines are assigned in a random order 
to each operator. Since dexterity is involved, there will be 
a difference among the operators in the time needed to ma- 
chine the widget. Table 6.6 assembles the time in minutes to 
manufacture a widget. 

Here, machine type is the treatment, while the uncontrol- 
lable factor is the operator. The effect of this factor is blocked 
or taken into consideration (or its effect minimized) by the 
randomized complete block design where all operators use 
all 4 machines. The analysis calls for testing the hypothesis 
at the 0.05 level of significance that the performance of the 
machines is identical. 

Let Factor A correspond to the machine type and B to the 
operator. Thus a=4 and b = 6, with replication r=l. Then, 
Eq. 6.1 reduces to: 



SST = SSA - 
5&4 = (6)[(41.3 



SSB + SSE 

■42.121)2 



SSB ■ 



-h (41.383 -42.121)2 + •■■ j 
: 15.92 
:(4)[(40.95- 42.121)2 

+ (40.525 -42.121)2 + ■■■ j 
: 42.09 



Total variation = SST 



= {abr 



\).stdev^ = (23)(1. 88652) ^ gjg^ 



Subsequently, SSE=81. 86- 15.92-42.09 = 23.84 

The ANOVA table can then be generated as given by Ta- 
ble 6.7. The value of F=3.34 is significant at p = 0.048. One 
would conclude that the performance of the machines cannot 
be taken to be similar at the 0.05 significance level (this is a 
close call though!). 



Machine 



Operator 





Machine 



Operator 



Fig. 6.4 Factor mean plots of the two factors with six levels for Opera- 
tor variable and four for Machine variable 



As illustrated earlier, graphical display of data can provi- 
de useful diagnostic insights in ANOVA type of problems as 
well. For example, a simple plotting of the raw observations 
around each treatment mean can provide a feel for variability 
between sample means and within samples. Figure 6.4 de- 
picts all the data as well as the mean variation. One notices 
that there are two unusually different values which stand out, 
and it may be wise to go back and study the experimental 
conditions which produced these results. Without these, the 
interaction effects seem small. 

A random effects model can also be identified. In this 
case, an additive linear model is appropriate such as: 



yij = < 3' > + a; + y6j •+ £ij 



(6.4) 



' From Walpole et al. (2007) by © permission of Pearson Education. 



Inspection of the residuals can provide diagnostic insights 
with regard to violation of normality and non-uniform va- 
riance akin to regression analysis. Since model predictions 
are given by: 

hj = < >' > +{Ai - <y>)-\-[Bj - <y>) 

^ Ai + Bj - <y> (6.5a) 

the residuals of the (i,j) observation are: 

fiy = yy - hj = yij -{Ai+Bj - <y>) 
/= 1,...,4 and 7 = 1,...,6 (6.5b) 

Two different residual plots have been generated. Figure 6.5 
and 6.6 reveal that the variance of the errors versus operators 
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Fig. 6.5 Scatter plot of the residuals versus the six operators 
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Fig. 6.6 Scatter plot of residuals versus predicted values 
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Fig. 6.7 Normal probability plot of the residuals 

and versus model predicted values are fairly random except 
for two large residuals (as noted earlier). Further, a normal 
probability plot of the model residuals seems to be normally 
distributed except for the two outliers (Fig. 6.7). 

An implicit and important assumption in the above model 
design is that the treatment and block effects are additive, i.e. 
negligible interaction effects. In the context of Example 6.2.2, 
it means that if, say, Operator 3 is 0.5 min faster on the ave- 
rage than Operator 2 on machine 1 , the same difference also 
holds for machines 2, 3, and 4. This pattern would be akin to 
that depicted in Fig. 6.3 where the mean responses of diffe- 
rent blocks differ by the same amount from one treatment to 
the next. In many experiments, this assumption of additivity 



does not hold, and the treatment and block effects interact (as 
illustrated in Fig. 6.2b). For example, Operator 1 may be fas- 
ter by 0.5 min on the average than Operator 2 when machine 
1 is used, but he may be slower by, say, 0.3 min on the average 
than Operator 2 when machine 2 is used. In such a case, the 
operators and the machines are said to be interacting. ■ 

The above treatment of full factorial designs was limited to 
two factors. The treatment can be extended to more number 
of factors, but the analysis gets messier though the extension 
is quite straightforward. The interested reader can refer to the 
Box et al. (1978) or Montgomery (2009) for such an analysis. 



6.2.2 Incomplete Factorial Designs — Latin 
Squares 

The previous section covered two important concepts (ran- 
domization and blocking) which, performed together, allow 
sounder conclusions to be drawn from fewer tests. After the 
experimental design is formulated, the sequence of the tests, 
i.e., the selection of the combinations of different levels and 
different factors should be done in a random manner. This 
randomization would reduce (maybe, even eliminate) un- 
foreseen biases in experimental data which could be due to 
the effect of subtle factors not considered in the experiment. 
The other concept, namely blocking, is a form of stratified 
sampling whereby subjects or items in the sample of data are 
grouped into blocks according to some "matching" criterion 
so that the similarity of subjects within each block or group 
is maximized while those from block to block are minimi- 
zed. Pharmaceutical companies wishing to test the effecti- 
veness of a new drug adopt the above concepts extensively. 
Since different people react differently, grouping of subjects 
is done according to some criteria (such as age, gender, body 
fat percentage,...). Such blocking would result in more uni- 
formity among groups. Subsequently, a random administra- 
tion of the drugs to half of the people within each block with 
a placebo to the other half would constitute randomization. 
Thus, any differences between each block taken separately 
would be more pronounced than if randomization was done 
without blocking. 

When multiple factors are studied at multiple levels, the 
number of experiments required for full-factorial design can 
increase dramatically: 



Number of Experiments for Full Factorial 

k 

= I I Levelsi 



(6.6) 



i=l 



where i is the index for the factors which total k. For the spe- 
cial case when all factors have the same number of levels, 
the number of experiments necessary for a complete facto- 
rial design which includes all main effects and interactions 
is M* where k is the number of factors and n the number of 
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levels. If certain assumptions are made, this number can be 
reduced considerably. Such methods are referred to as in- 
complete or fractional factorial designs. The Latin squares 
is one such special design meant for problems: (i) involving 
three factors or more, (ii) that allows blocking in two direc- 
tions, i.e., eliminating two sources of nuisance variability, 
(iii) where the number of levels for each factor is the same, 
and (iv) where interaction terms among factors are negligi- 
ble (i.e., the interaction terms (a/?)., in the statistical effects 
model given by Eq. 6.2 are dropped). This allows a large re- 
duction in the number of experimental runs especially when 
several levels are involved. 

A Latin Square for n levels denoted by (n x n) is a square 
of n rows and n columns with each of the n^ cells containing 
one specific treatment that appears once, and only once, in 
each row and column. Consider a three factor experiment 
at three different levels each. The number of experiments 
required for full factorial, i.e., to map out the entire expe- 
rimental space would be 3^^ = 27. For incomplete factorials, 
the number of experiments reduces to y or 9 experiments. 
The (3x3) Latin square design with three factors (A, B, C) 
is shown in Table 6.8. While levels A and B are laid down as 
rows and columns, the level of the third factor is displayed 
in each cell. Thus, the first cell requires a test with all three 
factors set at level 1, while the last cell requires A and B to be 
set at level 3 and C at level 2. A simple manner of generating 
Latin Square designs for higher values of n is to simply write 
them in order of level in the first row with the subsequent 
rows generated by simply shifting the sequence of levels one 
space to the left. 

Note that the Latin Square design shown in Table 6.8 is 
not unique. There are 12 different combinations of (3x3) 
Latin squares for the three level case but each design only 
needs 9 as against 27 experiments required for the full fac- 
torial design. Thus, Latin square designs reduce the required 
number of experiments from n^ to n^ (where n is the number 
of levels), thereby saving cost and time. In general, the frac- 
tional factorial design requires n''"' experiments, while the 
full factorial requires n''. 

Table 6.9 assembles the analysis of variance equations for 
a Latin Square design, which will be illustrated in Example 
6.2.3. It is said (Montgomery 2009) that Latin Square designs 
usually have a small number of error degrees of freedom (for 

Table 6.8 A (3 x 3) Latin Square Design with three factors (A, B, C) 
with 3 levels each denoted by (1, 2, 3) 



Table 6.9 The analysis of variance equations for (n x n) Latin square 
design 

Source of Sum of Degrees of Mean square Computed F 
variation squares Freedom statistic 



Row 


SSR 


n-1 




SSR/(n-l) 




Fj, = (MSR/MSE) 


Column 


SSC 


n-1 




SSC/(n-l) 




F^=(MSC/MSE) 


Treatment 


SSTr 


n-1 




SSTr/(n-l) 




F.j,^=(MSTr/MSE) 


Error 


SSE 


(n-l)(n- 


-2) 


SSE/(n-l)(n- 


-2) 


- 


Total 


SST 


n^-1 




- 




- 



example, 2 for a 3 x 3 and 6 for a 4 x 4 design), which allows 
a measure of model variance to be deduced. 

In conclusion, while the randomized block design allows 
blocking of one source of variation, the Latin square design 
allows systematic blocking of two sources of variability 
for problems involving three or more factors or variables. 
The concept, under the same assumptions as those for Latin 
Square design, can be extended to problems with four factors 
or more where three sources of variability need to be blo- 
cked; this is done using Graeco-Latin square designs (see. 
Box et al. 1978; Montgomery 2009). 

Example 6.2.3: Evaluating impact of three factors (school, 
air filter type and season) on breathing complaints 
In an effort to reduce breathing related complaints from stu- 
dents, four different types of air cleaning filters (labeled A, 
B, C and D, which are viewed as treatments in DOE termin- 
ology) are being considered for all schools in a district. Since 
seasonal effects are important, tests are to be performed un- 
der each of the four seasons (and correct for the days when 
the school is in session for each of these seasons). Further, it 
is decided that tests should be conducted in four schools (la- 
beled 1 through 4). Because of the potential for differences 
between schools, it is logical to insist that each filter type be 
tested at each school during each season of the year. 

(a) Develop a DOE This is a three factor problem with four 
levels in each, i.e., (4x4). The total number of treatment 
combinations for a completely randomized design would be 
4^ = 64. The selection of the same number of categories for 
all three criteria of classification could be done following a 
Latin square design, and the analysis of variance performed 
using the results of only 16 treatment combinations. A typi- 
cal Latin square, selected at random from all possible (4x4) 
squares, is given in Table 6.10. 



1 

2 
3 



1 2 3 

12 3 

2 3 1 

3 12 



Table 6.10 


Experimental design 






School 


Season 










FaU 


Winter 


Spring 


Summer 


1 


A 


B 


C 


D 


2 


D 


A 


B 


C 


3 


C 


D 


A 


B 


4 


B 


C 


D 


A 
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Table 6.1 1 Data table showing number of breathing complaints. A, B, 
C and D are four different types of air filters being evaluated. 


School Fall 


Winter 


Spring 


Summer 


Average 


1 A 
70 


B 

75 


C 

68 


D 

81 


73.5 


2 D 
66 


A 
59 


B 
55 


C 

63 


60.75 


3 C 
59 


D 
66 


A 
39 


B 

42 


51.50 


4 B 

41 


C 

57 


D 

39 


A 
55 


48.00 


Average 59.00 


64.25 


50.25 


60.25 


58.4375 



Table 6.1 2 ANOVA results following equations shown in Table 6.9 



The rows and columns represent the two sources of va- 
riation one wishes to control. One notes that in this design, 
each treatment occurs exactly once in each row and in each 
column. Such a balanced arrangement allows the effect of 
the air cleaning filter to be separated from that of the sea- 
son variable. Note that if interaction between the sources of 
variation is present, the Latin square model cannot be used; 
this assessment ought to be made based on previous studies 
or expert opinion. 

(b) Perform an ANOVA analysis Table 6.11 summarizes 
the data collected under such an experimental protocol, whe- 
re the numerical values shown are the number of breathing- 
related complaints per season corrected for the number of 
days when the school is in session and for changes in number 
of student population. 

Assuming that the various sources of variation do not in- 
teract, the objective is to statistically determine whether any 
(and, if so, which) of the three factors (school, season and 
filter type) affect the number of breathing complaints. 

Generating scatter plots such as that shown in Fig. 6.8 for 
filter type is a good start. In this case, one would make a fair 
guess based on the intra and within variation that filter type 
is probably not an influential factor on the number of com- 



I 78- 

a. 

I 68 
O 

58 



E 
^ 48 



38 



A B C D 

Filter Type 

Fig. 6.8 Scatter plot of filter type on number of complaints suggests a 
lack of correlation. This is supported by the ANOVA analysis. 



Source of 
variation 


Sum of 
squares 


Degrees of Mean 
freedom square 


Computed Probability 
F statistic 


School 


1557.2 


3 


519.06 


11.92 


0.006 


Season 


417.69 


3 


139.23 


3.20 


0.105 


Filter type 


263.69 


3 


87.90 


2.02 


0.213 


Error 


261.37 


6 


43.56 


- 


- 


Total 


2499.94 


15 


- 


- 


- 



plaints. The averages of the four treatments or filter types 
are: 

A = 55.75, B = 53.25, C = 61.75, D = 63.00 

The standard deviation is also determined as stdev= 12.91. 
The analysis of variance approach is likely to be more con- 
vincing because of its statistical rigor 

From the probability values in the last column of Table 6.12, 
it can be concluded that the number of complaints is strongly 
dependent on the school variable, statistically significant at 
the 0.10 level on the season, and not statistically significant 
on filter type. ■ 



6.3 Factorial Designs 

6.3.1 2'' Factorial Designs 

The above treatment of full and incomplete factorial designs 
can lead to a prohibitive number of runs when numerous 
levels need to be considered. As pointed out by Box et al. 
(1978), it is wise to design a DOE investigation in stages, 
with each successive iteration providing incremental insight 
into important issues and suggesting subsequent investigati- 
ons. Factorial designs, primarily 2'' and 3'', are of great value 
at the early stages of an investigation, where a large num- 
ber of possible factors are investigated with the intention of 
either narrowing down the number (as in screening design), 
or to get a preliminary understanding of the mathematical re- 
lationship between factors and the response variable. These 
are, thus, viewed as logical lead-in to the response surface 
method discussed in Sect. 6.4. The associated mathematics 
and interpretation of 2'' designs are simple, and can provide 
insights into the framing of more sophisticated and complete 
experimental designs. They are popular in R&D of products 
and processes, and are used extensively. 

The 2^ factorial design derives its terminology from the 
fact that only two levels for each factor k are presumed, one 
indicative of the lower level or range of variation (coded as -) 
and the other representing the higher level (coded as +). The 
factors can be categorical or continuous; if the latter, they are 
discretized into categories or levels. Figure 6.9a illustrates 
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Fig. 6.9 Illustration of how models are built from factorial design data. 
A 2- factorial design is assumed, a Discretize the range of variation of 
the regressors Xj and x^ into "low" and "high" ranges, and b regress 
the system performance data as they appear on a scatter plot (no factor 
interaction is assumed since the two lines are shown parallel) 



how the continuous regressors x^ and x^ are discretized de- 
pending on their range of variation into four system states, 
while Fig. 6.9b depicts how these four observations would 
appear in a scatter plot should they exhibit no factor interac- 
tion (that is why the lines are parallel). For two factors, the 
number of trials (without any replication) would be 2^=4; for 
three factors, this would be 2^ = 8, and so on. The formalism 
of coding the low and high levels of the factors - 1 and + 1 
respectively is most widespread though other ways of coding 
variables have been proposed. 



Table 6.1 3 The standard form (sugg 
two-level three-factor (or 2^) design 


ested by Yates) for setting up the 


Level of Factors 


Trial 


A 


B 


C 


Response 


1 


- 


- 


- 


y. 


2 


+ 


- 


- 


y2 


3 


- 


+ 


- 


y. 


4 


+ 


+ 


- 


y. 


5 


- 


- 


+ 


y. 


6 


+ 


- 


+ 


y. 


7 


- 


+ 


+ 


yi 


8 


+ 


+ 


+ 


y. 



Table 6.13 depicts a quick and easy way of setting up a 
two-level three-factor design (following the standard form 
suggested by Yates). Notice that the last but one column has 
four (-) followed by four (+), the last but two column by suc- 
cessive pairs of (-) and (H-), and the second has alternating 
(-) and (+). The Yates algorithm is easily extended to higher 
number of factors. However, the sequence in which the runs 
are to be performed should be randomized; a good way is to 
simply sample the set of trials { 1 , . . . , 8 } in random fashion 
without replacement. 

The approach can be modified to treat the case of parame- 
ter interaction, i.e., when the factors interact. Table 6.13 is 
simply modified by including separate columns for the three 
interaction terms, as shown in Table 6.14. The sign coding 
for the interactions is determined by multiplying the signs of 
each of the two corresponding terms. For example, interac- 
tion AB for trial 1, would be (-)(-) = (h-); and so on. 

The main effect of, say, factor C can be determined sim- 
ply as: 



Main effect of C = C+ - C_ 

(ys -Vyb + yi + yi) (yi +yl^-y^^- yA) 



(6.7) 



Statistical text books on DOE provide elaborate details of 
how to obtain estimates of all main and interaction effects 



Table 6.14 The standard form of the two-level three-factor (or 2^) de- 
sign with interactions 





Level of Factors 


Interactions 








Trial 


A 


B 


C 


AB 


AC 


BC 


ABC 


Response 


1 


- 


- 


- 


-1- 


-1- 


-h 


- 


y, 


2 


+ 


- 


- 


- 


- 


+ 


-F 


y2 


3 


- 


+ 


- 


- 


+ 


- 


+ 


y^ 


4 


+ 


+ 


- 


+ 


- 


- 


- 


y. 


5 


- 


- 


+ 


+ 


- 


- 


+ 


y. 


6 


+ 


- 


+ 


- 


+ 


- 


- 


y. 


7 


- 


+ 


+ 


- 


- 


+ 


- 


yi 


8 


+ 


+ 


+ 


+ 


+ 


+ 


+ 


y. 
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Table 6.1 5 


Response table for the 2' 


design with interactions (omitting 


the ABC term) 












Trial 


Resp. 


A 

+ 




A 


B. 


B 


c. 




C 


AB, 


AB 


AC, 


AC_ 


BC, 


BC_ 


1 


yi 






y. 




yi 






yi 


yi 




y. 




yi 




2 


yi 


y2 








y2 






yi 




yz 




yj 


y2 




3 


y. 






y3 


y3 








y^ 




y. 


y. 






y3 


4 


y. 


y4 






y4 








y. 


y* 






y* 




y4 


5 


y^ 






y^ 




ys 


ys 






ys 






ys 




ys 


6 


ys 


y6 








ye 


y6 








y. 


y. 






ys 


7 


y-, 






y. 


y. 




y-, 








y. 




y. 


y. 




8 


ys 


ys 






ys 




y> 






y. 




y» 




ys 




Sum 


No. 


8 


4 




4 


4 


4 


4 




4 


4 


4 


4 


4 


4 


4 


Avg 




A+ 




A_ 


B+ 


B_ 


c+ 




C_ 


Afi+ 


AB_ 


AC+ 


AC_ 


fiC+ 


BC_ 


Effect 




A+- 


-A_ 




B+- 


-S_ 


c+- 


-c 


- 


Afi+- 


- AB_ 


AC+- 


-AC_ 


BC+- 


-BC_ 



when more factors are to be considered, and then how to use 
statistical procedures such as ANOVA to identify the signi- 
ficant ones. The standard form shown in Table 6.14 can be 
rewritten as shown in Table 6.15 for the 2^ design with in- 
teractions. This is referred to as the response table form, and 
is advantageous in that it allows the analysis to be done in a 
clear and modular manner. The interpretation of what is im- 
plied by the interaction terms appearing in the table needs to 
be clarified. For example, AB denotes the effect of A when 
B is held fixed at the B or higher level. On the other hand, 
AB denotes the effect of A when B is held fixed at the B or 
lower level. 

For example, the main effect of A is conveniently deter- 
mined as: 



1 r 

^{A+-A^)= - [(yi + y4 + ye + y&) 
- (yi +yi+y5 +77)] 



(6.8a) 



while the interaction effect of, say, BC can be determined 
by the average of the B effect when C is held constant at H- 1 
minus the B effect when C is held constant at -1. Thus: 

Interaction effect of BC = {B'C+ - BC-) 

= 7 [O'l +y2 +yi +78) - CF3 +J4 +J5 +76)] 

(6.8b) 

These individual and interaction effects directly provide a 
prediction model of the form: 

y ^bo+biA + biB +b2C 

•■ . ' 

Main effects 

+ bnAB + bnAC + b2iBC + bmABC (6.9a) 

Interaction teiTns 



The intercept term is given by the grand average of all 
the response values y. This model is analogous to Eq. 5.18c 
which is one form of the additive multiple linear models di- 
scussed in Chap. 5. Note that Eq. 6.9 has eight parameters 
and with eight experimental runs, the model fit will be per- 
fect with no variance. A measure of the random error can 
only be deduced if the degrees of freedom (d.f.)>0, and so 
replication (i.e., repeats of runs) is necessary. Another op- 
tion, relevant when interaction effects are known to be neg- 
ligible, is to adopt a model with only main effects such as: 



y^ba + bxA^-biB+biC 



(6.9b) 



In this case, d.f. =4, and so a measure of random error of the 
model can be determined. 

Example 6.3.1: Deducing a prediction model for a 2^ fac- 
torial design 

Consider a problem where three factors {A, B, C} are pre- 
sumed to influence a response variable y. The problem is to 
perform a DOE, collect data, ascertain the statistical import- 
ance of the factors and then identify a prediction model. The 
numerical values of the factors or regressors corresponding 
to the high and low levels are assembled in Table 6. 16 (while 
the numerical value of A is a fraction about its mean value, 
those of B and C are not). 

It was decided to use two replicate tests for each of the 8 
combinations to enhance accuracy. Thus 16 runs were per- 



Table 6.16 Assumed low and 
(Example 6.3.1) 


high 


levels 


for the three factors 


Factor Low level 






High level 


A 0.9 






1.1 


B 1.20 






1.30 


C 20 






30 
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Table 6.1 7 


Standard table (Example 


6.3.1) 




Level of Factors 


Trial 


A 


B 


C 


Responses 


1 


0.9 


1.2 


20 


34,40 


2 


1.1 


1.2 


20 


26,29 


3 


0.9 


1.3 


20 


33,35 


4 


1.1 


1.3 


20 


21,22 


5 


0.9 


1.2 


30 


24,23 


6 


1.1 


1.2 


30 


23,22 


7 


0.9 


1.3 


30 


19,18 


8 


1.1 


1.3 


30 


18,18 



formed and the results are tabulated in the standard form as 
suggested by Yates (Table 6.13) and shown in Table 6.17. 
(a) Identify statistically significant terms 
This tabular data can be used to create a table similar to Tab- 
le 6.15, which is left to the reader. Then, the main effects and 
interaction terms can be calculated following Eq. 6.8. Thus: 
Main effect of factor A: 



1 

(2X4) 



[(26 + 29) + (21 + 22) + (23 + 22) 

+ (18 + 18) - (34 + 40) - (33 + 35) 

-(24 + 23) -(19 +18)] 

47 
= = -5.875 



while the effect sum of squares SSA =(-47.0)716 = 
138.063 

Similarly, B: -4.625; C: -9.375; AB: -0.625; AC: 5.125; 
BC: -0.125; ABC: 0.875. The results of the ANOVA ana- 
lysis are assembled in Table 6.18. One concludes that the 
main effects A, B and C and the interaction effect AC are 
significant at the 0.01 level. The main effect and interaction 
effect plots are shown in Figs. 6.10 and 6.11. These plots do 



Table 6.18 Results of the ANOVA analysis. Interaction effects AB, 
BC and ABC are not significant (Example 6.3.1) 



Source 


Sum of 
Squares 


D.f 


Mean 
Square 


F-Ratio 


p- Value 


Main effects 


Factor A 


138.063 


1 


138.063 


41.68 


0.0002 


Factor B 


85.5625 


1 


85.5625 


25.83 


0.0010 


Factor C 


351.563 


1 


351.563 


106.13 


0.0000 


Interactions 


AB 


1.5625 


1 


1.5625 


0.47 


0.5116 


AC 


105.063 


1 


105.063 


31.72 


0.0005 


BC 


0.0625 


1 


0.0625 


0.02 


0.8941 


ABC 


3.063 


1 


3.063 


0.92 


0.3640 


Residual or 
error 


26.5 


8 


3.3125 






Total 
(corrected) 


711.438 


15 









confirm that interaction effects are present only for factors A 
and C since the lines are clearly not parallel. 

(b) Identify prediction model 

Only four terms, namely A, B, C and AC interaction were 
found to be statistically significant at the 0.05 level. (See Ta- 
ble 6. 1 8) In such a case, the functional form of the prediction 
model given by Eq. 6.9 reduces to: 

y — bo + biXA + b2XB + biXc + b4XAXc 

Substituting the values of the effect estimates determined 
earlier results in 

y = 25.313 - 2.938x4 - 2.313x5 

- 4.688xe + 2.563x^xc (6. 10) 

where coefficient b^ is the mean of all observations. Also, 
note that the values of the model coefficients are half the 
values of the main and interaction effects determined in part 
(a). For example, the main effect of factor A was calculated 
to be (-5.875) which is twice the (-2.938) coefficient for 
the x^ factor shown in the equation above. The division by 
2 is needed because of the manner in which the factors were 
coded, i.e. the high and low levels, coded as H- 1 and - 1, are 
separated by 2 units. 

The performance equation thus determined can be used 
for predictions. For example, when x^=+l,x^ = -l,x^ = -l, 
one gets y — 26.813 which agrees reasonably well with the 
average of the two replicates performed (26 and 29). 

(c) Comparison with linear multiple regression approach 
The parallel between this approach and regression modeling 
involving indicator variables (described in Sect. 5.7.3) is ob- 
vious but note worthy. For example, if one were to perform a 
multiple regression to the above data with the three regress- 
ors coded as - 1 and + 1 for low and high values respectively, 
one obtains the following results (Table 6.19). 

Note that the same four variables (A, B, C and AC inter- 
action) are statistically significant while the model coeffi- 
cients are identical to the ones determined by the ANOVA 
analysis. If the regression were to be redone with only these 
four variables present, the model coefficients would be iden- 
tical. This is a great advantage with factorial designs in that 
one could include additional variables incrementally in the 
model without impacting the model coefficients of variables 
already identified. Why this is so is explained in the next 
section. Table 6.20 assembles pertinent goodness-of-fit indi- 
ces for the complete model and the one with only the four 
significant regressors. Note that while the R^ value of the for- 
mer is higher (a misleading statistic to consider when dealing 
with multivariate model building), the adjusted R^ and the 
RMSE of the reduced model are superior. Finally, Figs. 6.12 
and 6.13 are model predicted versus observed plots which 
allow one to ascertain how well the model has fared; in this 
case, there seems to be larger scatter at higher values indica- 
tive of non-additive errors. This suggests that a linear additi- 
ve model may not be the best. ■ 
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Fig. 6.10 Main effect scatter plots 
for the three factors 



Fig. 6.11 Interaction plots 
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6.3.2 Concept of Orthogonality 

An important concept in DOE is orthogonality by which is 
implied that trials should be framed such that the data matrix 
X^ results in (X'X) — —1.^ In such a case, the off-diagonal 



Refer to Sect. 5.4.3 for refresher 



^ Recall from basic geometry that two straight lines are perpendicular 
when the product of their slopes is equal to -1. Orthogonality is an 
extension of this concept to multi-dimensions. 



terms of the matrix (X'X) will be zero, i.e., the regressors 
are uncorrelated. This would lead to the best designs since it 
would minimize the variance of the regression coefficients. 
For example, consider Table 6.13 where the standard form 
for the two-level three-factor design is shown. Replacing low 
and high values (i.e., - and H-) by - 1 and + 1, and noting that 
an extra column of 1 needs to be introduced to take care of 
the constant term in the model (see Eq. 5.26b) results in the 
regressor matrix being defined by: 
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Table 6.19 Results of perfomiing a multiple linear regression to the 
same data with coded regressors (Example 6.3.1) 



Parameter 


Parameter 
estimate 


Standard 
error 


t-statistic 


p- value 


Constant 


25.3125 


0.455007 


55.631 


0.0000 


Factor A 


-2.9375 


0.455007 


-6.45595 


0.0002 


Factor B 


-2.3125 


0.455007 


-5.08234 


0.0010 


Factor C 


-4.6875 


0.455007 


-10.302 


0.0000 


Factor A*FactorB 


-0.3125 


0.455007 


-0.686803 


0.5116 


Factor A*Factor C 


2.5625 


0.455007 


5.63178 


0.0005 


Factor B*Factor C 


-0.0625 


0.455007 


-0.137361 


0.8941 


Factor A*FactorB* 
Factor C 


0.4375 


0.455007 


0.961524 


0.3644 



Table 6.20 Goodness-of-fit statistics of multiple linear regression 
models (Example 6.3.1) 



Regression model 


Model R^ 


Adjusted R^ 


RMSE 


With all terms 


0.963 


0.930 


1.820 


With only four signifi- 
cant terms 


0.956 


0.940 


1.684 



X ^ 





-1 


— 1 


-1 




+1 


— 1 


-1 




-1 


+1 


-1 




+1 


+1 


-1 




-1 


— 1 


+1 




+1 


— 1 


+1 




-1 


+1 


+1 




+1 


+1 


+1 



(6.11) 



The reader can verify that the off-diagonal terms of the ma- 
trix (X'X) are indeed zero. All n'' factorial designs are thus 
orthogonal, i.e., (X'X)"' is a diagonal matrix with nonzero 
diagonal components. This leads to the most sound parame- 
ter estimation (as discussed in Sect. 11.2). Another benefit of 
orthogonal designs is that parameters of regressors already 
identified remain unchanged as additional regressors are ad- 
ded to the model; thereby allowing the model to be develo- 
ped incrementally. Thus, the effect of each term of the model 
can be examined independently. These are two great benefits 
when factorial designs are adopted for model identification. 
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Fig. 6.12 Observed versus predicted values for the regression model 
indicate larger scatter at high values (Example 6.3.1) 
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Fig. 6.13 Model residuals versus model predicted values highlight the 
larger scatter present at higher values indicative of non-additive errors 
(Example 6.3.1) 



Example 6.3.2:"* Matrix approach to a inferring prediction 
model for a 2* design 

This example will illustrate the analysis procedure for a com- 
plete 2'' factorial design with three factors similar to Exam- 
ple 6.3.1 but following the matrix formulation. The model, 
assuming a linear form, is given by Eq. 6.9a and includes 
individual or main and interaction effects. Denoting the three 
factors by x^, x^ and x^, the regressor matrix X will have four 
parameters (the intercept term is the first additional term) as 
well as the four interaction terms as shown below (refer to 
Table 6.14 for ease in understanding the matrix): 



X = 



'1 -1 -1 


— 1 


1 


1 


1 -1 


I 1 -1 


-1 


-1 


-1 


1 1 


I -1 1 


— 1 


-1 


1 


-1 1 


1 1 1 


-1 


1 


-1 


-1 -1 


1 -1 -1 




1 


-1 


-1 1 


11-1 




-1 


1 


-1 -1 


1 -I I 




-1 


-I 


1 -1 


1 1 1 




1 
1 1 


1 


1 1 












Main effects 


Interaction effects 



Let us assume that a DOE has yielded the following values 
for the response variable: 

Y^ = [49 62 44 58 42 73 35 69] 

The intention is to identify a parsimonious model, i.e., one 
in which only the statistically significant terms following 
Eq. 6.9a are retained in the model. 



^ From Beck and Arnold ( 1 977) by permission of Beck. 
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The inverse of X^X = (i) I and the X^X terms can be 
deduced by taking the sums of the y terms multiplied by 
either + or - as indicated in X^. The coefficient b||=54 (ave- 
rage of all eight values of y), bj following Eq. 6.8a is: h^ = 
[(62 + 58 + 73 + 69)-(49 + 44+42 + 35)]/(4x2)=11.5, andso 
on. The resulting model is: 

fi = 54 + 11.5x1,- — 2.5x2; + 0.75x3,- + 0.5xi,-X2,- 

+ 4.75xi,-X3,- - 0.25x2,-X3,- + 0.25xi,-X2,-X3,- (6.12a) 

With eight parameters and eight observations (and no repli- 
cation), the model will be perfect with zero degrees of free- 
dom; this is referred to as a saturated model. This is not a 
prudent situation since a model variance cannot be computed 
nor can the p-values of the various terms inferred. Had repli- 
cation been adopted, an estimate of the variance in the model 
could have been conveniently estimated and some measure 
of the goodness-of-fit of the model deduced (as in Example 
6.3.1). In this case, the simplest recourse is to drop one of 
the terms from the model (say the (x^x^x^) interaction term) 
and then perform the ANOVA analysis. Because of the ortho- 
gonal behavior, the significance of the dropped term can be 
evaluated at a later stage without affecting the model terms 
already identified. 

The effect of individual terms is now investigated in a 
manner similar to the previous example. The ANOVA ana- 
lysis shown in Table 6.21 suggests that only the terms x^ and 
(X|X^) are statistically significant at the 0.05 level. However, 
the p value for x^ is close, and so it would be advisable to 
keep this term. Thus, the parsimonious model assumes the 
form: 

3?,- = 54 + 11.5x1,- - 2.5x2,- + 4.75xi,-X3,- (6.12b) 

The above example illustrates how data gathered within a 
DOE framework and analyzed following the ANOVA met- 
hod can yield an efficient functional predictive model of the 
data. It is left to the reader to repeat the analysis illustrated 
in Example 6.3.1 where an identical model was obtained 



Table 6.21 Results of the ANOVA analysis (Example 6.3.2) 




Source 


Sum of 
squares 


D.f 


Mean 
square 


F-ratio 


p- value 


Main effects 


Factor x^ 


1058 


1 


1058 


2116 


0.0138 


Factor x^ 


50.0 


1 


50.0 


100 


0.0635 


Factor x^ 


4.50 


1 


4.50 


9.00 


0.2050 


Interactions 


x,x^ 


2.00 


1 


2.00 


4.00 


0.2950 


X,X3 


180.5 


1 


180.5 


361 


0.0335 


V3 


0.50 


1 


0.50 


1.00 


0.5000 


Residual or error 


0.50 


1 


0.50 






Total (Corrected) 


1296 


7 














Fractional factorial 



Full factorial 



Fig. 6.14 Illustration of the differences between fractional and full fac- 
torials runs for a 2' DOE experiment. See Table 6.14 for a specification 
of the full factorial design. Several different combinations of fractional 
factorials designs are possible; only one such combination is shown 

by straightforward use of multiple linear regression. Note 
that orthogonality is maintained only if the analysis is done 
with coded variables (- 1 and -i- 1), and not with the original 
ones. 

Recall that a 2' factorial design implies two regressors or 
factors, each at two levels; say "low" and "high". Since there 
are only two states, one can only frame a first order func- 
tional model to the data such as Eq. 6.9. Thus, a 2^ factorial 
design is inherently constrained to identifying a first order li- 
near model between the regressors and the response variable. 
If the mathematical relationship requires higher order terms, 
multi-level factorial designs are more appropriate. Such de- 
signs would allow a model of the form given by Eq. 5.23 
(which is a full second-order polynomial model). For exam- 
ple, the 3'' design will require the range of variation of the 
factors to be aggregated into three levels, such as "low", 
"medium" and "high". If the situation is one with three fac- 
tors (i.e., k=3), one needs to perform 27 experiments even 
if no replication tests are considered. This is more than three 
times the number of tests needed for the 2^ design. Thus, the 
added higher order insight can only be gained at the expense 
of a larger number of runs which, for higher number of fac- 
tors, may become prohibitive. 

One way of greatly reducing the number of runs, pro- 
vided interaction effects are known to be negligible, is to 
adopt incomplete or fractional factorial designs (described 
in Sect. 6.2). The 27 tests needed for a full 3"* factorial de- 
sign can be reduced to 9 tests only. Thus, instead of 3'' tests, 
an incomplete block design would only require (3''"') tests. 
A graphical illustration of how a fractional factorial design 
differs from a full factorial one for a 2^ instance is illustra- 
ted in Fig. 6.14. Three factors are involved (A, B and C) at 
two levels each (-1, 1). While 8 test runs are performed 
corresponding to each of the 8 corners of the cube for the 
full factorial, only 4 runs are required for the fractional fac- 
torial as shown. The interested reader can refer to the Box 
et al. (1978) or Montgomery (2009) for detailed treatment 
of higher level factorial design methods both complete and 
incomplete. 
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6.4 Response Surface Designs 
6.4.1 Applications 

Recall that the factorial methods described in the previous 
section can be applied to either continuous or discrete ca- 
tegorical variables. The 2^ factorial methods allow both for 
screening to identify dominant factors, and for identifying 
a robust linear predictive model. In a historic timeline, the- 
se techniques were then extended to optimizing a process 
or product by Box and Wilson in the early 1950s. A special 
class of mathematical and statistical techniques were deve- 
loped meant to identify models and analyze data between a 
response and a set of continuous variables with the intent of 
determining the conditions under which a maximum (or a 
minimum) of the response variable is obtained. For example, 
the optimal mix of two alloys which would result in the pro- 
duct having maximum strength can be deduced by fitting the 
data from factorial experiments with a model from which the 
optimum is determined either by calculus or search methods 
(described in Chap. 7 under optimization methods). These 
models, called response surface models (RSM), can be fra- 
med as either first order or second order models, linear in the 
parameters, depending on whether one is far or close to the 
desired optimum. Response surface designs involve not just 
the modeling aspect, but also recommendations on how to 
perform the sequential search involving several DOE steps. 

The reader may wonder why most of the DOE models trea- 
ted in this chapter assume empirical polynomial models. This 
was because of historic reasons where the types of applicati- 
ons which triggered the development of DOE were not un- 
derstood well enough to adopt mechanistic functional forms. 
Empirical polynomial models are linear in the parameters but 
can be non-linear in their functional form (such as Eq. 6.9a). 
Recall that a function is strictly linear only when it contains 
first-order regressors (i.e., main effects of the factors) with 
no interacting terms (such as Eq. 6.9b). Non-linear functional 
models can arise by introducing interaction terms in the first 
order linear model, as well as using higher order terms, such as 
quadratic terms for the regressors (see Eq. 5.21 and 5.23). 

A second class of problems to which RSM can be used 
involves simplifying the search for an optimum when detailed 
computer simulation programs of physical systems requiring 
long-run times are to be used. The similarity of such pro- 
blems to DOE experiments on physical processes is easily 
made, since the former requires: (i) performing a sensitivi- 
ty analysis to determine a subset of dominant model input 
parameter combinations (akin to screening), (ii) defining a 
suitable approximation for the true functional relationship 
between response and the set of independent variables (akin 
to determining the number of necessary levels), (iii) making 
multiple runs of the computer model using specific values 
and pairings of these input parameters (akin to performing 
factorial experiments), and (iv) fitting an appropriate mat- 



hematical model to the data. This fitted response-surface is 
then used as a replacement or proxy for the computer model, 
and all inferences related to optimization/uncertainty analy- 
sis requiring several thousands of simulations for the original 
model are derived from this fitted model. The validity of this 
approach is of course contingent on the fact that the computer 
simulation is an accurate representation of the physical sys- 
tem. Thus, this application is very similar to the intent behind 
process optimization except that model simulations are done 
to predict system response instead of actual experiments. 



6.4.2 Phases Involved 

A typical RS experimental design involves three general 
phases performed with the specific intention of limiting the 
number of experiments required to achieve a rich data set. 
This will be illustrated using the following example. The 
R&D staff of a steel company wants to improve the strength 
of the metal sheets sold. They have identified a preliminary 
list of factors that might impact the strength of their metal 
sheets including the concentrations of chemical A and che- 
mical B, the annealing temperature, the time to anneal and 
the thickness of the sheet casting. The first phase is to run a 
screening design to identify the main factors influencing the 
metal sheet strength. Thus, those factors that are not import- 
ant contributors to the metal sheet strength are eliminated 
from further study. How to perform such screening tests have 
been discussed in Sect. 6.3.1. 

As an illustration, it is concluded that the chemical con- 
centrations A and B are the main factors that survive the 
screening design. To optimize the mechanical strength of the 
metal sheets, one needs to know the relationship between the 
strength of the metal sheet and the concentration of chemi- 
cals A and B in the formula; this is done in the second phase 
which requires a sequential search process. The following 
steps are undertaken: 

(i) Identify the levels of the amount of chemicals A and B 
to study. Use 2 levels for linear relationships and 3 or 
more levels for non-linear relationships, 
(ii) Generate the experimental design using one of several 

factorial methods, 
(iii) Run the experiments, 
(iv) Analyze the data using ANOVA. 

(v) Draw conclusions and develop a model for the response 
variable. Unfortunately, this is likely to be an approxi- 
mate model representative of the behavior of the metal 
in the local search space only; and usually, one cannot 
simply use this model to identify the global maximum. 
(vi) Using optimization methods (such as calculus based 
methods or search methods such as steepest descent), 
move in the direction of the search space where the 
overall optimum is likely to lie (refer to Example 7.4.2). 
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(vii) Repeat steps (i) through (vi) till the global minimum is 
reached. 
Once the optimum has been identified, the R&D staff 
would want to confirm that the new, improved metal sheets 
have higher strength; this is the third phase. They would re- 
sort to hypothesis tests involving running experiments to sup- 
port the alternate hypothesis that the strength of the new, im- 
proved metal sheet is greater than the strength of the existing 
metal sheet. In summary, the goals of the second and third 
phases of the RS design are to determine and then confirm, 
with the needed statistical confidence, the optimum levels of 
Chemicals A and B that maximize the metal sheet strength. 



6.4.3 First and Second Order Models 

In most RS problems, the form of the relationship between 
the response and the regressors is unknown. Consider the 
case where the yield (Y) of a chemical process is to be ma- 
ximized with temperature (T) and pressure (P) being the 
two independent variables (from Montgomery 2009). The 
3-D plot (called the response surface in DOE terminology) 
is shown in Fig. 6.15, while its projection of a 2-D plane, 
known as a contour plot, is also shown. The maximum yield 
is achieved under T= 138 and P= 18, at which the maximum 
yield Y= 70. If one did not know the shape of this curve, one 
simple approach would be to assume a starting point (say, 
T=117 and P=20, as shown) and repeatedly perform expe- 
riments in an effort to reach the maximum point. This is akin 
to a univariate optimization search (see Sect. 7.4) which is 
not very efficient. In this example involving a chemical pro- 
cess, varying one variable at a time may work because of the 
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Fig. 6.15 A three-dimensional response surface between the respon- 
se variable (the expected yield) and two regressors (temperature and 
pressure) with the associate contour plots indicating the optimal value. 
(From Montgomery 2009 by permission of John Wiley and Sons) 
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Fig. 6.16 Figure illustrating how the first-order response surface model 
(RSM) fit to a local region can progressively lead to the global optimum 
using the steepest descent search method 



symmetry of the RS. However, in case (and this is often so) 
when the RS is asymmetrical or when the search location is 
far away from the optimum, such a univariate search may 
erroneously indicate a non-optimal maximum. A superior 
manner, and the one adopted in most numerical methods is 
the steepest gradient method which involves adjusting all 
the variables together (see Sect. 7.4). As shown in Fig. 6.16, 
if the responses Y at each of the four corners of the square 
are known by experimentation, a suitable model is identi- 
fied (in the figure, a linear model is assumed and so the set 
of lines for different values of Y are parallel). The steepest 
gradient method involves moving along a direction perpendi- 
cular to the sets of lines (indicated by the "steepest descent" 
direction in the figure) to another point where the next set 
of experiments ought to be performed. Repeated use of this 
testing, modeling and stepping is likely to lead one close to 
the sought-after maximum or minimum (provided one is not 
caught in a local peak or valley or a saddle point). 

The following recommendations are noteworthy so as to 
minimize the number of experiments to be performed: 

(a) During the initial stages, of the investigation, a first- 
order polynomial model in some region of the range 
of variation of the regressors is usually adequate. Such 
models have been extensively covered in Chap. 5 with 
Eq. 5.25 being the linear first order model form in vec- 
tor notation. 2^ factorial designs are good choices at the 
preliminary stage of the RS investigation. As stated ear- 
lier, due to the benefit of orthogonality, these designs 
are best since they would minimize the variance of the 
regression coefficients. 

(b) Once close to the optimal region, polynomial models 
higher than first order are advised. This could be a 
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first-order polynomial (involving just main effects) or a 
higher-order polynomial which also includes quadratic 
effects and interactions between pairs of factors (two- 
factor interactions) to account for curvature. Quadratic 
models are usually sufficient for most engineering ap- 
plications, though increasing the order of approxima- 
tion to higher orders could, sometimes, further reduce 
model errors. Of course, it is unlikely that a polynomial 
model will be a reasonable approximation of the true 
functional relationship over the entire space of the inde- 
pendent variables, but for a relatively small region they 
usually work quite well (Montgomery 2009). Note that 
rarely would all of the terms of the quadratic model be 
needed; and how to identify a parsimonious model has 
been illustrated in Examples 6.3.1 and 6.3.2. 



6.4.4 Central Composite Design and tlie 
Concept of Rotation 

For a 3'' factorial design with the number of factors k= 3, one 
needs 27 experiments with no replication, which, for k=4 
grows to 81 experiments. Thus, number of trials at each itera- 
tion point increase geometrically. Hence, 3'' designs become 
impractical as k gets much above 3. A more efficient manner 
of designing experiments is to use the concept of rotation, 
also referred to as axi-symmetric. An experimental design is 
said to be rotatable if the trials are selected such that they are 
equi-distant from the center. Since the location of the opti- 
mum point is unknown, such a design would result in equal 
precision of estimation in all directions. In other words, the 
variance of the response variable at any point in the regressor 
space is function of only the distance of the point from the 
design center. 

Several rotatable as well as non-rotatable designs can be 
found in the published literature. Of these, the Central com- 
posite design (CCD) is probably the most widely used for 
fitting a second order response surface. A CCD contains a 
fractional factorial design that is augmented with a group of 
axial points that allow estimation of curvature. Center point 
runs (which are essentially random repeats of the center 
point) are included to provide a measure of process stability 
(i.e., reduces model prediction errors) and capture any in- 
herent variability. They, also, provide a check for curvature, 
i.e., if the response surface is curved, the center points will 
be lower or higher than predicted by the design points. The 
factorial or "cube" portion and center points (shown as circ- 
les in Fig. 6.17) may aid in fitting a first-order (linear) model 
during the preliminary stage while still providing evidence 
regarding the importance of a second-order contribution or 
curvature. A CCD always contains twice as many axial (or 
star) points as there are factors in the design. The star points 
represent new extreme values (low and high) for each factor 
in the design. Thus, the total number of experimental runs 
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Fig. 6.17 A Central Composite Design (CCD) for two factors contains 
two sets of trials: a fractional factorial or "cube" portion which serve as 
a preliminary stage where one can fit a first-order (linear) model, and 
a group of axial or "star" points that allow estimation of curvature. A 
CCD always contains twice as many axial (or star) points as there are 
factors in the design. In addition, a certain number of center points are 
also used so as to capture inherent random variability in the process or 
system behavior 

for a CCD with k factors = 2'' -H 2k H-c where c is the number 
of center points. 

CCDs allow for efficient estimation of the quadratic 
terms in the second-order model since they inherently satisfy 
the desirable design properties of orthogonal blocking and 
rotatability. A central composite design with two and three 
factors is shown in Fig. 6.18. For a two-factor experiment 
design, the CCD generates 4 factorial points and 4 axial po- 
ints; for a three-factor experiment design, the CCD generates 
8 factorial points and 6 axial points. The number of center 
points for some useful CCDs have also been suggested. So- 
metimes, more center points than the numbers suggested are 
introduced; nothing will be lost by this except the cost of per- 
forming the additional runs. For a two-factor CCD, at least 
two center points should be used, while many researchers 
routinely use as many as 6-8 points. 

If the distance from the center of the design space to a 
factorial point is ± 1 unit for each factor, the distance from 
the center of the design space to an axial point is ± a with 
lal> 1. The precise value of a depends on certain properties 
desired for the design (for example, whether or not the de- 
sign is orthogonally blocked) and on the number of factors 
involved. Similarly, the number of center point runs needed 
for the design also depends on certain properties required for 
the design. To maintain rotatability, the value of a for CCD is 
chosen such that (Montgomery 2009): 



(«/) 



1/4 



(6.13) 
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(-1,1) 



(0,0) 
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(a,0) 
(1,-1) 



.i^""1" 



Three factor 



Fig. 6.18 Rotatable central composite designs for two factors and three 
factors during RSM. The black dots indicate locations of experimental 
runs 
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where n^ is the number of experimental runs in factorial por- 
tion. For example: if the experiment has 2 factors, the full 
factorial portion would contain 2- =4 points; the value of a 
for rotatability would be a = (2^)"''= 1.414; if the experiment 
has 3 factors, a = (2^)"''= 1.682; if the experiment has 4 fac- 
tors, a = (2'')"'*=2; and so on. As shown in Fig. 6.18, CCDs 
usually have axial points outside the "cube", (unless one in- 
tentionally specifies a < 1 due to, say, safety concerns in 
performing the experiments). Finally, since the design points 
describe a circle circumscribed about the factorial square, the 
optimum values must fall within this experimental region. If 
not, suitable constraints must be imposed on the function to 
be optimized. This is illustrated in the example below. For 
further reading on RSD, the texts by Box et al. (1978) and 
Montgomery (2009) are recommended. 

Example 6.4.1:^ Optimizing the deposition rate for a tungs- 
ten film on silicon wafer. 

A two-factor rotatable central composite design (CCD) was 
run so as to optimize the deposition rate for a tungsten film 
on silicon wafer. The two factors are the process pressure (in 
Torr) and the ratio of H, to WF^^ in the reaction atmosphere. 
The ranges for these factors are given in Table 6.22. 

Let X| be the pressure factor and x^ the ratio factor. 
The rotatable CCD design with three center points was 
performed, with the experimental results assembled in 
Table 6.23. 

A second order linear regression with all 1 1 data points 
results in a model with Adj-R2=0.969 and RMSE = 608.9. 
The model coefficients assembled in Table 6.24 indicate that 
coefficients (Xj*x,), and (Xj^*x,^) are not statistically signi- 
ficant. Dropping these terms results in a better model with 
Adj-R-=98.3, and RMSE=578.8. The con-esponding valu- 
es of the model coefficients are shown in Table 6.25. In de- 
termining whether the model can be further simplified, one 
notes that the highest p-value on the independent variables 
is 0.0549, belonging to (x^^). Since the p-value is greater or 
equal to 0.05, that term is not statistically significant at the 
95.0% or higher confidence level. Consequently, one could 
consider removing this term from the model; this, however, 
was not done here since the value is close to 0.05. 

Thus, the final model is: 



Table 6.23 Results of the CCD rotatable design for two factors with 3 
center points (Example 6.4.1) 



y = 8972.6 + 3454.4xi + 1566.8x2 
- 762X? - 579.5X? 



(6.14) 



Table 6.22 Assumed low and high levels for the two factors 
(Example 6.4.1) 


Factor 


Low level High level 


Pressure 


4 80 


Ratio HjAVF^ 


2 10 



'^1 


\ 




y 


-1 


-1 




3663 


1 


-1 




9393 


-1 


1 




5602 


1 


1 




12488 


-1.414 







1984 


1.414 







12603 





-1.414 


5007 





1.414 


10310 










8979 










8960 










8979 


Table 6.24 Model coefficients for the second order complete model 
(Example 6.4.1) 


Parameter 


Estimate 


Standard 
error 


t-statistic p-value 


Constant 


8972.6 


351.53 


25.5246 0.0000 


'^1 


3454.43 


215.284 


16.046 0.0001 


^^2 


1566.79 


215.284 


7.27781 0.0019 


X,*X^ 


289.0 


304.434 


0.949303 0.3962 


x,'^2 


-839.837 


277.993 


-3.02107 0.0391 


x/2 


-657.282 


277.993 


-2.36438 0.0773 


Table 6.25 


Model coefficients for the reduced model (Example 6.4. 1) 


Parameter 


Estimate 


Standard 
error 


t-statistic p-value 


Constant 


8972.6 


334.19 


26.8488 0.0000 


^, 


3454.43 


204.664 


16.8785 0.0000 


x^ 


1566.79 


204.664 


7.65544 0.0003 


x,'^2 


-762.044 


243.63 


-3.12787 0.0204 



x/2 



-579.489 243.63 



-2.37856 



0.0549 



Table 6.24 Model coefficients for the second order complete model 
(Example 6.4.1) 



310.952 



430.6 



0.722137 



0.5102 



From Buckner et al. (1993). 



It would be wise to look at the residuals and if there are any 
unusual ones. Figures 6.19 and 6.20 clearly indicate that one 
of the points (the first row, i.e., y = 3663) is unusual with very 
high studentized residuals (recall that studentized residuals 
measure how many standard deviations each observed value 
of y deviates from a model fitted using all of the data except 
that observation — Sect. 5.6.2). Those greater than 3 in ab- 
solute value warrant a close look and if necessary removed 
prior to model fitting. 

The optimal values of the two regressors associated 
with the maximum response are determined by taking 
partial derivatives and setting them to zero. This yields: 
Xj = 2.267 and x,= 1.353. However, this optimum lies out- 
side the experimental region and does not satisfy the sphe- 
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Fig. 6.19 Observed versus model predicted values 
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Fig. 6.20 Studentized residuals versus model predicted values 
(Example 6.4.1) 

rical constraint, which in this case is x\ + x\ <1 . This is 
illustrated in the contour plot of Fig. 6.21. Resorting to a 
constrained optimization (see Sect. 7.3) results in the op- 
timal values of the regressors: Xj* = 1.253 and x,* = 0.656 
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representing a maximum deposition rate y* = 12,883. In 
terms of the original variables, these correspond to a pres- 
sure = [(80-4)/2+1.253*(80-4)/2) = 85.6torr and a Ratio 
H,AVFj_ = 8.0. Finally, a confirmatory experiment would have 
to be conducted in the neighborhood of this optimum. ■ 



Problems 

Pr. 6.1 Consider Example 6.2.2 where the performance of 
four machines was analyzed in terms of machining time with 
operator dexterity being a factor to be blocked. How to iden- 
tify an additive linear model was also illustrated. Figure 6.2a 
suggests that interaction effects may be important. You will 
re-analyze the data to determine whether interaction terms 
are statistically significant or not. 

Pr. 6.2^ Full-factorial design for evaluating three different 
missile systems 

A full-factorial experiment is conducted to determine which 
of 3 different missile systems is preferable. The propellant 
burning rate for 24 static firings was measured using four 
different propellant types. The experiment performed dupli- 
cate observations (replicate = 2) of burning rates (in minutes) 
at each combination of the treatments. The data, after coding, 
is given in Table 6.26. 

The following hypotheses tests are to be studied: 

(a) There is no difference in the mean propellant burning 
rates when different missile systems are used, 

(b) there is no difference in the mean propellant burning 
rates of the 4 propellant types, 

(c) there is no interaction between the different missile sys- 
tems and the different propellant types, 

Pr. 6.3 Random effects model for worker productivity 
A full-factorial experiment was conducted to study the ef- 
fect of indoor environment condition (depending on such 
factors as dry bulb temperature, relative humidity. . .) on the 
productivity of workers manufacturing widgets. Four groups 
of workers were selected distinguished by such traits as age, 
gender,... called Gl, G2, G3 and G4. The number of wid- 
gets produced over a day by two members of each group 



Table 6.26 Burning rates 
plicates (Problem 6.2) 


in minutes for the (3 x 4) 


;;ase with two re- 


Missile 


Propellant type 






System 


\ 


b. 


bs 


b4 


A, 


34.0, 32.7 


30.1, 32.8 


29.8, 26.7 


29.0, 28.9 


A. 


32.0, 33.2 


30.2, 29.8 


28.7, 28.1 


27.6, 27.8 


A3 


28.4, 29.3 


27.3, 28.9 


29.7, 27.3 


28.8,29.1 













Fig. 6.21 Contour plot of Eq. 6. 14 



From Walpole et al. (2007) by © permission of Pearson Education. 
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Table 6.27 Showing the number of widgets produced by day using a 



replicate r=2 (Problem 6.3) 








Group number 


Environmental 
Conditions 


Gl 


G2 


G3 


G4 


El 


227, 221 


214, 259 


225, 236 


260, 229 


E2 


187, 208 


181, 179 


232, 198 


246, 273 


E3 


174, 202 


198, 194 


178,213 


206, 219 



under three different environmental conditions (El, E2 and 

E3) was recorded. These results are assembled in Table 6.27. 

Using 0.05 significance level, test the hypothesis that: 

(a) different environmental conditions have no effect on 
number of widgets produced, 

(b) different worker groups have no effect on number of 
widgets produced, 

(c) there is no interaction effects between both factors. 
Subsequently, identify a suitable random effects model, 

study model residual behavior and draw relevant conclusi- 
ons. 

Pr. 6.4 The thermal efficiency of solar thermal collectors 
decreases as their average operating temperatures increase. 
One of the means of improving the thermal performance is to 
use selective surfaces for the absorber plates which have the 
special property that the absorption coefficient is high for the 
solar radiation and low for the infrared radiative heat losses. 
Two collectors, one without a selective surface and another 
with, were tested at four different operating temperatures 
under replication r=4. The experimental results of thermal 
efficiency in % are tabulated In Table 6.28. 

(a) Perform an analysis of variance to test for significant 
main and interaction effects, 

(b) Identify a suitable random effects model, 

(c) Identify a linear regression model and compare your re- 
sults with those from part (b), 

(d) Study model residual behavior and draw relevant con- 
clusions. 

Pr. 6.5 The close similarity between a factorial design mo- 
del and a multiple linear regression model was illustrated in 
Example 6.3.1. You will repeat this exercise with data from 
Example 6.3.2. 



Table 6.28 Thennal efficiencies (%) of the two solar thermal collec- 



tors (Problem 6.4) 












Mean operating temperature 


(°C) 




80 


70 


60 


50 


Without selective surface 


28, 29, 
31,32 


34, 33, 
35,34 


38, 39, 
41,38 


40, 42, 
41,41 


With selective surface 


33, 36, 
33,34 


38, 38, 
36,35 


41,40, 
43,42 


43,45, 
44,45 



(a) Identify a multiple linear regression model and verify 
that the parameters of all regressors are identical to the 
factorial design model, 

(b) Verify that model coefficients do not change when mul- 
tiple linear regression is redone with the reduced model 
using variables coded as - 1 and + 1, 

(c) Perform a forward step-wise linear regression and ver- 
ify that you get back the same reduced model with the 
same coefficients. 

Pr. 6.6 2^ factorial analysis for strength of concrete mix 
A civil construction company wishes to maximize the 
strength of its concrete mix with three factors or variables: 
A — water content, B — coarse aggregate, and C — silica. A 2^ 
full factorial set of experimental runs, consistent with the no- 
menclature of Table 6.13, was performed. These results are 
assembled below: 

[ 58.27, 55.06, 58.73, 52.55, 54.88, 58.07, 56.60, 59.57] 

(a) You are asked to analyze this data so as to identify sta- 
tistically meaningful terms, 

(b) If the minimum and maximum range of the three fac- 
tors are: A(0.3576, 0.4392), B(0.4071, 0.4353) and 
C(0.0153, 0.0247), develop a prediction model for this 
problem, 

(c) Identify a multiple linear regression model and verify 
that the parameters of all regressors are identical to the 
factorial design model, 

(d) Verify that model coefficients do not change when mul- 
tiple linear regression is redone with the reduced mo- 
del. 

Pr. 6.7 As part of the first step of a response surface (RS) 
approach, the following linear model was identified from 
preliminary experimentation using two coded variables 

y — 55 — 2.5x1 -|- 1.2x2 with — 1 < x,- < -1-1 

Determine the path of steepest ascent, and draw this path on 
a contour plot. 

Pr. 6.8' Predictive model inferred from 2^ factorial design 
on a large laboratory chiller 

Table 6.29 assembles steady state data of a 2^ factorial series 
of laboratory tests conducted on a 90 Ton centrifugal chiller. 
There are three response variables (T^j^^ — chilled water lea- 
ving the evaporator, T^^. — cooling water entering the conden- 
ser, and Q^j^ — chiller cooling load) with two levels each, the- 
reby resulting in 8 data points without any replication. Note 
that there are small differences in the high and low levels of 
each of the factors because of operational control variability 



' Adapted from a more extensive table from data collected by Coms- 
tock and Braun (1999). We are thankful to James Braun for providing 
this data. 
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Table 6.29 Laboratory tests from a 


. centrifugal chiller (Problem 6.8) 












Data for model 


1 development 




Data for cross 


-validation 






Test# 


cho 

(°C) 


T 
(°C) 


(kW) 


COP 


cho 

CO 


(°C) 


(kW) 


COP 


1 


10.940 


29.816 


315.011 


3.765 


7.940 


29.628 


286.284 


3.593 


2 


10.403 


29.559 


103.140 


2.425 


7.528 


24.403 


348.387 


4.274 


3 


10.038 


21.537 


289.625 


4.748 


6.699 


24.288 


188.940 


3.678 


4 


9.967 


18.086 


122.884 


3.503 


7.306 


24.202 


93.798 


2.517 


5 


4.930 


27.056 


292.052 


3.763 










6 


4.541 


26.783 


109.822 


2.526 










7 


4.793 


21.523 


354.936 


4.411 










8 


4.426 


18.666 


114.394 


3.151 











during testing. The chiller Coefficient of Performance (COP) 
is the response variable. 

(a) Perform an ANOVA analysis, and check the importance 
of the main and interaction terms using the 8 data points 
indicated in the table, 

(b) Identify the parsimonious predictive model from the ab- 
ove ANOVA analysis, 

(c) Identify a least square regression model with coded va- 
riables and compare the model coefficients with those 
from the model identified in part (b), 

(d) Generate model residuals and study their behavior (in- 
fluential outliers, constant variance and near-normal 
distribution), 

(e) Reframe both models in terms of the original variables 
and compare the internal prediction errors, 

(f) Using the four data sets indicated in the table as holdout 
points meant for cross-validation, compute the NMSE, 
RMSE and CV values of both models. Draw relevant 
conclusions. 
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Optimization Methods 



This chapter provides an introductory overview of traditional 
optimization techniques as applied to engineering applicati- 
ons. These apply to situations where the impact of uncertain- 
ties is relatively minor, and can be viewed as a subset of de- 
cision-making problems which are treated in Chap. 12. After 
defining the various terms used in the optimization literature, 
calculus based methods covering both analytical as well as 
numerical techniques are reviewed. Subsequently, different 
solutions to problems which can be grouped as linear, qua- 
dratic or non-linear programming are described, while highl- 
ighting the differences between them and the methods used 
to solve such problems. A complete illustrative example of 
how to set up an optimization problem for a combined heat 
and power system is presented. Methods that allow global 
solutions as against local ones are described. Finally, the 
important topic of dynamic optimization is covered which 
applies to optimizing a trajectory, i.e., to discrete situations 
when a series of decisions have to be made to define or ope- 
rate a system composed of distinct stages, such that a deci- 
sion is made at each stage with the decisions at later stages 
not affecting the performance of earlier ones. There is a vast 
amount of published material on the subject of optimization, 
and this chapter is simply meant as a brief overview. 



7.1 Background 

One of the most important tools for both design and operati- 
on of engineering systems is optimization which corresponds 
to the case of decision-making under low uncertainty. This 
branch of applied mathematics, also studied under "opera- 
tions research" (OR), is the use of specific methods where 
one tries to minimize or maximize a global characteristic 
(say, the cost or the benefit) whose variation is modeled by 
an objective function. The set-up of the optimization pro- 
blem involves both the formulation of the objective function 
but as importantly, the explicit and complete consideration 
of a set of constraints. Optimization problems arise in almost 
all branches of industry or society, e.g., in product and en- 



gineering process design, production scheduling, logistics, 
traffic control and even strategic planning. 

Optimization in an engineering context involves certain 
basic elements consisting of some or all of the following: 
(i) the framing of a situation or problem (for which a so- 
lution or a course of action is sought) in terms of a mathe- 
matical model often called the objective function; this could 
be a simple expression, or framed as a decision tree model 
in case of multiple outcomes (deterministic or probabilistic) 
or sequential decision making stages, (ii) defining the ran- 
ge constraints to the problem in terms of input parameters 
which may be dictated by physical considerations, (iii) pla- 
cing bounds on the solution space of the output variables in 
terms of some practical or physical constraints, (iv) defining 
or introducing uncertainties in the input parameters and in 
the types of parameters appearing in the model, (v) mathe- 
matical techniques which can solve such models efficiently 
(short execution times) and accurately (unbiased solutions); 
and (vi) sensitivity analysis to gauge the robustness of the 
optimal solution to various uncertainties. 

Framing of the mathematical model involves two types 
of uncertainties: epistemic or lack of complete knowledge of 
the process or system which can be reduced as more data is 
acquired, and aleotory uncertainty which has to do with the 
stochasticity of the process, and cannot be reduced by collec- 
ting more data. These notions are discussed at some length 
in Chap. 12 while dealing with decision analysis. This chap- 
ter deals with traditional optimization techniques as applied 
to engineering applications which are characterized by low 
aleatory and epistemic uncertainty. Further, recall the concept 
of abstraction presented in Sect. 1.2.3 in the context of for- 
mulating models. It pertains to the process of deciding on the 
level of detail appropriate for the problem at hand without, 
on one hand, over-simplification which may result in loss of 
important system behavior predictability, while on the other 
hand, avoiding the formulation of an overly-detailed model 
which may result in undue data and computational resources 
as well as time spent in understanding the model assumpti- 
ons and results generated. The same concept of abstraction 
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APi=(2.1 x10iO)(Fi)2 




F = 0.01 m3/s 




10000 



0.008 



0.01 



Fig. 7.1 Pumping system whose operational cost is to be optimized 
(Example 7.1.1) 

also applies to the science of optimization. One has to set a 
level of abstraction commensurate with the complexity of the 
problem at hand and the accuracy of the solution sought. 

Let us consider a problem framed as finding the optimum 
of a continuous function. There could, of course, be the added 
complexity of considering several discrete options; but each 
option has one or more continuous variables requiring proper 
control so as to achieve a global optimum. A simple example 
(Pr. 1.7 from Chap. 1) will be used to illustrate this case. 

Example 7.1.1: Simple example involving function minimi- 
zation 

Two pumps with parallel networks (Fig. 7.1) deliver a volume- 
tric flow rate F=0.01 mVs of water from a reservoir to the de- 
stination. The pressure drops in Pascals (Pa) of each network 
are given by: Api = (2.1).10'°.ff and ^Pi = (3.6).10'".F2^ 
where F^ and F^ are the flow rates through each branch in mVs. 
Assume that both the pumps and their motor assemblies have 
equal efficiencies r]^=r\^=Q.9. Let P^ and P^ be the electric pow- 
er in Watts (W) consumed by the two pump-motor assemblies. 
Since, power consumed is equal to volume flow rate times 
the pressure drop, the objective function to be minimized is 
the sum of the power consumed by both pumps: 



J = 
or J = 



' Api.Fi AP2F2 



ri\ 



m 



■(2.1).10'"./='f (3.6).10"' 



F^ 



0.9 



0.9 



(7.1) 



The sum of both flows is equal to 0.01 mVs, and so F^ can be 
eliminated in Eq. 7.1. Thus, the sought-after solution is the 
value of F which minimizes the objective function J: 



Min {J) = Min 



{2.l).W^.Fl 
(h9 



(3.6).10>''.(0.01 -Fl) 



0.9 



(7.2) 



Fig. 7.2 One type of post-optimality analysis involves plotting the ob- 
jective function for Total Power to evaluate the shape of the curve near 
the optimum. In this case, there is a broad optimum indicating that the 
system can be operated near-optimally over this range without much 
corresponding power penalty 

From basic calculus, dJ/dF^ — would provide the 
optimum solution from where F^ =0.00567 mVs and 
F2=0.00433 mVs, and the total powerofbothpumps = 7501 W. 
The extent to which non-optimal performance is likely to lead 
to excess power can be gauged (referred to as post-optimality 
analysis) by simply plotting the function J vs F^ (Fig. 7.2) In 
this case, the optima is rather broad; the system can be ope- 
rated such that F| is in the range of 0.005-0.006 mVs without 
much power penalty. On the other hand, sensitivity analysis 
would involve a study of how the optimum value is affected 
by certain parameters. For example. Fig. 7.3 shows that va- 
rying the efficiency of pump 1 in the range of 0.85-0.95 has 
negligible impact on the optimal result. However, this may 
not be the case for some other variable. A systematic study of 
how various parameters impact the optimal value falls under 
sensitivity analysis, and there exist formal methods of inves- 
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Fig. 7.3 Sensitivity analysis with respect to efficiency of pump 1 on 
the overall optimum 
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tigating this aspect of the problem. Finally, note that this is 
a very simple optimization problem with a simple imposed 
constraint which was not even considered during the optimi- 
zation. ■ 



7.2 Terminology and Classification 

7.2.1 Basic Terminology and Notation 

A mathematical formulation of an optimization problem in- 
volving control of an engineering system consists of the fol- 
lowing terms or categories: 

(i) Decision variables or process variables, say (x^, x^ . . . 
X ) whose respective values are to be determined. These 
can be either discrete or continuous variables; 
(ii) Control variables, which are the physical quantities 
which can be varied by hardware according to the nu- 
merical values of the decision variables sought. Deter- 
mining the "best" numerical values of these variables is 
the basic intent of optimization; 
(iii) Objective function, which is an analytical formulation 
of an appropriate measure of performance of the system 
(or characteristic of the design problem) in terms of de- 
cision variables; 
(iv) Constraints or restrictions on the values of the decisi- 
on variables. These can be of two types: non-negative 
constraints, for example, flow rates cannot be negative; 



and functional constraints (also called structural cons- 
traints), which can be equality, non-equality or range 
constraints that specify a range of variation over which 
the decision variables can be varied. These can be based 
on direct considerations (such as not exceeding capacity 
of energy equipment, limitations of temperature & pres- 
sure control values,...) or on indirect ones (when mass 
and energy balances have to be satisfied). 
(v) Model parameters are constants appearing in constraints 
and objective equations. 
Establishing the objective function is often simple. The 
real challenge is usually in specifying the complete set of 
constraints. A feasible solution is one which satisfies all the 
stated constraints, while an infeasible solution is one where 
at least one constraint is violated. The optimal solution is a 
feasible solution that has the most favorable value (either ma- 
ximum or minimum) of the objective function, and it is this 
solution which is being sought after. The optimal solutions 
can be a single point or even several points. Also, some pro- 
blems may have no optimal solutions at all. Figure 7.4 shows 
a function to be maximized subject to several constraints (six 
in this case). Note that there is no feasible solution and one of 
the constraints has to be relaxed or the problem reframed. In 
some optimization problems, one can obtain several feasible 
solutions. This is illustrated in Fig. 7.5 where several combi- 
nations of the two variables, which define the line segment 
shown, are possible optima. 



Fig. 7.4 An example of a 
constrained optimization problem 
with no feasible solution 




Maximize Z = 2x, + SXj 
Subject to Xi £ 4 



and 



X2 5 4 
X, + Xj s 6 

2xi + 3X2 ^ 30 
Xi>OandX25 



Fig. 7.5 An example of a 
constrained optimization problem 
with more than one feasible 
solution 
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Sometimes, an optimal solution may not necessarily be 
the one selected for implementation. A '' satisficing" solution 
(combination of words "satisfactory" and "optimizing") may 
be the solution which is selected for actual implementation 
and reflects the difference between theory (which yields an 
optimal solution) and reality faced (due to actual implemen- 
tation issues, heuristic constraints which cannot be expressed 
mathematically, the need to treat unpredictable occurrences, 
risk attitudes of the owner/operator,....). Some practitioners 
also refer to such solutions as "near-optimal" though this has 
a sort of negative connotation. 



7.2.2 Traditional Optimization Methods 

Optimization methods can be categorized in a number of ways: 

(i) Linear vs. non-linear. Linear optimization problems in- 
volve linear models and a linear objective function and 
linear constraints. The theory is well developed, and 
solutions can be found relatively quickly and robustly. 
There is an enormous amount of published literature on 
this type of problem, and it has found numerous practical 
applications involving upto several thousands of indepen- 
dent variables. There are several well-know techniques to 
solve them (the Simplex method in Operations Research 
used to solve a large set of linear equations being the best 
known). However, many problems from engineering to 
economics require the use of non-linear models or cons- 
traints, in which case, non-linear programming techni- 
ques have to be used. In some cases, non-linear models 
(for example, equipment models such as chillers, fans, 
pumps and cooling towers) can be expressed as quadra- 
tic models, and algorithms more efficient than non-linear 
programming ones have been developed; this falls under 
quadratic programming methods. Calculus-based met- 
hods are often used for finding the solution of the model 
equations. 

(ii) Continuous vs. discontinuous. When the objective functi- 
ons are discontinuous, calculus based methods can break 
down. In such cases, one could use non-gradient based 
methods or even heuristic based computational methods 
such as simulated annealing or genetic algorithms. The 
latter are very powerful in that they can overcome pro- 
blems associated with local minima and discontinuous 
functions, but they need long computing times and a cer- 
tain amount of knowledge of such techniques on the part 
of the analyst. Another form of discontinuity arises when 
one or more of the variables are discrete as against conti- 
nuous. Such cases fall under the classification known as 
integer or discrete programming. 

(iii) Static vs. dynamic. If the optimization is done with time 
not being a factor, then the procedure is called static. Ho- 
wever, if optimization has to be done over a time peri- 



od where decisions can be made at several sub-intervals 
of that period, then a dynamic optimization method is 
warranted. Two such examples are when one needs to 
optimize the route taken by a salesman visiting different 
cities as part of his road trip, or when the operation of a 
thermal ice storage supplying cooling to a building has 
to be optimized during several hours of the day during 
which high electric demand charges prevail. Whene- 
ver possible, analysts make simplifying assumptions to 
make the optimization problem static, 
(iv) Deterministic vs stochastic. This depends on whether one 
neglects or considers the uncertainty associated with va- 
rious parameters of the objective function and the cons- 
traints. The need to treat these uncertainties together, and 
in a probabilistic manner, rather than one at a time (as is 
done in a sensitivity analysis) has led to the development 
of several numerical techniques, the Monte Carlo techni- 
que being the most widely used (Sect. 12.2.7). 



7.2.3 Types of Objective Functions 

Single criterion optimization is one where a single over-ri- 
ding objective function can be formulated. It is used in the 
majority of optimization problems. For example, an indus- 
trialist is considering starting a factory to assemble photo- 
voltaic (PV) cells into PV modules. Whether to invest or not, 
and if yes, at what capacity level are issues which can both 
be framed as a single criterion optimization problem. Ho- 
wever, if maximizing the number of jobs created is another 
(altruistic) objective, then the problem must be treated as a 
multi-criteria decision problem. Such cases are discussed in 
Sect. 12.2.6. 



7.2.4 Sensitivity Analysis or Post Optimality 
Analysis 

Model parameters are often not known with certainty, and 
could be based on models identified from partial or incom- 
plete observations, or they could even be guess-estimates. 
The optimum is only correct insofar as the model is accu- 
rate, and the model parameters and constraints reflective of 
the actual situation. Hence, the optimal solution determined 
needs to be reevaluated in terms of how the various types 
of uncertainties affect it. This is done by sensitivity analysis 
which determines range of values: 

(i) of the parameters over which the optimal solutions will 
remain unchanged (allowable range to stay near- opti- 
mal). This would flag critical parameters which may re- 
quire closer investigation, refinement and monitoring; 
(ii) over which the optimal solution will remain feasible with 
adjusted values for the basic variables (allowable range 
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to stay feasible, i.e. the constraints are satisfied). This 
would help identify influential constraints. 
Further, the above evaluations can be performed by ad- 
opting: 

(i) individual parameter sensitivity, where one parameter at 
a time in the original model is varied (or perturbed) to 
check its effect on the optimal solution; 
(ii) total sensitivity (also called parametric programming) 
involves the study of how the optimal solution changes 
as many parameters change simultaneously over some 
range. Thus, it provides insight into "correlated" para- 
meters and trade off in parameter values. Such evalua- 
tions are conveniently done using Monte Carlo methods 
(Sect. 12.2.7). 



7.3 Calculus-Based Analytical and Search 
Solutions 

Calculus-based solution methods can be applied to both li- 
near and non-linear problems, and are the ones to which un- 
dergraduate students are most likely to be exposed to. They 
can be used for problems where the objective function and 
the constraints are differentiable. These methods are also re- 
ferred to as classical or traditional optimization methods as 
distinct from machine-learning methods. A brief review of 
calculus-based analytical and search methods is presented 
below. 




a Convex 



Concave 



Combination 



Fig. 7.6 Illustrations of convex, concave and combination functions. A 
convex function is one where every point on the line joining any two 
points on the graph does not lie below the graph at any point. A combi- 
nation function is one which exhibits both convex and concave behavior 
during different portions with the switch-over being the saddle point 



checking that its value is positive. Graphically, a minimum 
for a continuous function is found (or exists) when the func- 
tion is convex, while a saddle point is found for a combina- 
tion function (see Fig. 7.6). In the multivariate optimization 
case, one checks whether the Hessian matrix (i.e., the second 
derivative matrix which is symmetrical) is positive definite 
or not. It is tedious to check this condition by hand for any 
matrix whose dimensionality is greater 2, and so computer 
programs are invariably used for such problems. A simple 
hand calculation method (which works well for low dimen- 
sion problems) for ascertaining whether the optimal point is 
a minimum (or maximum) is to simply perturb the optimal 
solution vector obtained, compute the objective function and 
determine whether this value is higher (or lower) than the 
optimal value found. 



7.3.1 Simple Unconstrained Problems 

The basic calculus of the univariate unconstrained optimi- 
zation problem can be extended to the multivariate case of 
dimension n by introducing the gradient vector V and by re- 
calling that the gradient of a scalar y is defined as: 



Vy 



<^y . Sy . 



9x1 



9x2 



12 



9x„ 



where ij, i^. . .i^ are unit vectors, and y is the objective func- 
tion and is a function of n variables: y = y(x,,x„ ...,x ) 

With this terminology, the condition for optimality of a 
continuous function y is simply: 



Vv = 



(7.4) 



However, the optimality may be associated with statio- 
nary points which could be minimum, maximum, saddle or 
ridge points. Since objective functions are conventionally ex- 
pressed as a minimization problem, one seeks the minimum 
of the objective function. Recall that for the univariate case, 
assuring that the optimal value found is a minimum (and not 
a maximum or a saddle point) involves computing the nume- 
rical value of the second derivative at this optimal point, and 



Example 7.3.1: Determine the minimum value of the follo- 



wing function: y — 



1 
4x1 



8xfx2 + 



1 



First, the two first order derivatives are found: 



9xi 



4xf 



1 6x1X2 and 



^ - 8x2 
9x2 ~ ^ ' 



(7.3) Setting the above two expressions to zero and solving re- 



sults in: X* — 0.2051 and x| — 1.8114 at which condition 
y* = 2.133. It is left to the reader to verify these results, and 
check whether this is indeed the minimum. ■ 



7.3.2 Problems with Equality Constraints 

Most practical problems have constraints in terms of the 
independent variables, and often these assume the form of 
equality constraints only. There are several semi-analyti- 
cal techniques which allow the constrained optimization 
problem to be reformulated into an unconstrained one, and 
the manner in which this is done is what differentiates the- 
se methods. In such a case, one does not need generalized 
optimization solver approaches requiring software programs 
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which need greater skill to use properly and may take longer 
to solve. 

The simplest approach is the direct substitution method 
where for a problem involving "n" variables and "m" equa- 
lity constraints, one tries to eliminate the m constraints by 
direct substitution, and solve the objective function using the 
unconstrained solution method described above. This appro- 
ach was used in Example 7.1.1. 

Example 7.3.2:' Direct substitution method 
Consider the simple optimization problem stated as: 



The resulting intersection is a parabola whose optimum is 
the solution of the objective function being sought after. 
Notice how this constrained optimum is different from the 
unconstrained optimum which occurs at (0, 0) (Fig. 7.7). ■ 
The above approach requires that one variable be first 
explicitly expressed as a function of the remaining variab- 
les, and then eliminated from all equations; this procedure is 
continued till there are no more constraints. Unfortunately, 
this is not an approach which is likely to be of general appli- 
cability in most problems. 



Minimize /(x) — 4X[ + 5xj 
subject to: 2x\ + 3x2 — 6 



(7.5a) 7.3.3 Lagrange Multiplier Method 



(7.5b) 



Either x^ or x^ can be eliminated without difficulty. Say, the 
constraint equation is used to solve for x^, and then substitu- 
ted into the objective function. This yields the unconstrained 
objective function: 

f{x2) = 14x| - 36x2 + 36. 

The optimal value of x,* = 1 .286 from which, by substitution, 
X|* = 1.071. The resulting value of the objective function is: 
/(x)* = 12.857. 

This simple problem allows a geometric visualization to 
better illustrate the approach. As shown in Fig. 7.7, the ob- 
jective function is a paraboloid shown on the z axis with x^ 
and Xj being the other two axes. The constraint is represented 
by a plane surface which intersects the paraboloid as shown. 





1.286 



Contours of 
f(x) projected 
onto the x^ - Xg 
plane 



Fig. 7.7 Graphical representation of how direct substitution can reduce 
a function with two variables X| and x, into one with one variable. The 
unconstrained optimum is at (0, 0) at the center of the contours. (From 
Edgar et al. 2001 by permission of McGraw-Hill) 



A more versatile and widely used approach which allows 
the constrained problem to be reformulated into an uncons- 
traint one is the Lagrange multiplier approach. Consider an 
optimization problem involving an objective function y, a 
set of n decision variable x and a set of m equality cons- 
traints h(x): 

Minimize >>= ^(x) objective function (7.6a) 

subject to h{\) — equality constraints (7.6b) 

The Lagrange multiplier method simply absorbs the equality 
constraints into the objective function, and states that the op- 
timum occurs when the following modified objective func- 
tion is minimized: 



J* = min[>'(x)} 

= v(x) - Ai./;i(x) - ki-hii^) 



= 



(7.7) 



where the quantities X^,l^,... are called the Lagrange multi- 
pliers. The optimization problem, thus, involves minimizing 
y with respect to both x and the Lagrange multipliers. 

The cost of eliminating the constraints comes at the price 
of increasing the dimensionality of the problem from n to 
(n-i-m), or stated differently, one is now seeking the values 
of (n-nm) variables as against n only which will optimize the 
function y. 

A simple example with one equality constraint serves to 
illustrate this approach. The objective function y = 2x^ + 3x^ 
is to be optimized subject to the constraint xiXj — 48 . Fi- 
gure 7.8 depicts this problem visually with the two variables 
being the two axes and the objective function being repre- 
sented by a series of parallel lines for different assumed va- 
lues of y. Since the constraint is a curved line, the optimal 
solution is obviously the point where the tangent vector of 
the curve (shown as a dotted line) is parallel to these lines 
(shown as point A). 



From Edgar et al. (2001) by permission of McGraw-Hill. 
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or 



Fig. 7.8 Optimization of the linear function y = 2X| + 3x2 subject to the 
constraint shown. The problem is easily solved using the Lagrange mul- 
tiplier method to yield optimal values of (X|* = 3, x,*=4). Graphically, 
the optimum point A occurs where the constraint function and the lines 
of constant y (which, in this case, are linear) have a common normal 
indicated by the arrow at A 



Example 7.3.3: Optimizing a solar water heater system 
using the Lagrange method 

A solar water heater consisting of a solar collector and sto- 
rage tank is to be optimized for lowest first cost consistent 
with the following specified system performance. During the 
day the storage temperature is to be raised from 30°C (equal 
to the ambient temperature T ) to a temperature T , while 

A a' A max 

during the night heat is to be withdrawn from storage such 
that the storage temperature drops back to 30°C for the next 
day's operation. The system should be able to store 20 MJ of 
thermal heat over a typical day of the year during which W^, 
the incident radiation over the collector operating time (assu- 
med to be 10 h) is 12 MJ/m^. The collector performance cha- 
racteristics^ are F^r\^=Q.2> and F^U^=4.0 W/m^."C. The costs 
of the solar subsystem components are fixed cost C^^ = $ 600, 
collector area proportional cost C^=$ 200/m^ of collector 
area, and storage volume proportional cost C^ = $ 200/m^ of 
storage volume. 

Assume that the average inlet temperature to the collector 
is equal to the arithmetic mean of T and T . 

^ max a 

Let A (m^) and V (m^) be the collector area and storage 



volume respectively. The objective function is; 
/ = 600 + 200.Ac+200.ys 



(7.8) 



The constraint of the daily amount of solar energy collected 
is expressed as: 

Qc = AcIHtFrtio - UiiTci - 7;).Af] (7.9a) 

where At is the number of seconds during which the collector 
operates. 



(20)(10'*) = Ac {(12)(10*')(0.8) - (4)(3600)(10), 
■7™,,±30_3^ 



or 



20.10'' = (11.76 -0.072.r„,x)Ac 



(7.9b) 



A heat balance on the storage over the day yields 

or 20.10^= Vs(1000).(4190)(7;n,x- 30) (7.10) 

20 



from which TL 



30 



(4.19)^5 
Substituting this back into the constraint Eq. 7.9b results in 



0.344\ 
Ac 9.6 - -^— 1 = 20 



Vs 



This allows the combined Lagrangian objective function to 
be deduced as: 



/ = 600 + 200. Ac + 200. Vs - HAc 9.6 



0.344 

Vs 



-20) 
(7.11) 



The resulting set of Lagrangian equations are: 



SJ 
8J 



= = 200 



0.344\ 
9.6 U 

Vs 



0.344 
= 200 -AAc ^ 

SV, V^ 



SJ 



= = Ac 9.6 



2 

0.344 



20 



Solving this set yields the sought-after optimal values: 
A|^*=2.36m^ and V^* =0.308 m^ The value of the Lagran- 
gian multiplier is A=23.58, and the corresponding initial cost 
J*=$ 1134. The Lagrangian multiplier can be interpreted as 
the sensitivity coefficient, which in this example corresponds 
to the marginal cost of solar thermal energy. In other words, 
increasing the thermal requirements by 1 MJ would lead to an 
increase of /l=$23.58 in the initial cost of the optimal solar 
system. ■ 



7.3.4 Penalty Function Method 



See Pr 5.6 for a description of the solar collector model. 



Another widely used method for constrained optimizing is 
the use of penalty factors where the problem is converted 
to an unconstrained one. Consider the problem defined by 
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Eq. 7.6 with the possibihty that the constraints can be in- 
equality constraints as well. Then, a new unconstrained func- 
tion is framed as: 



X2 = 



12 
~P 



36 



108 



+ 12 



J* =min[y(x)} =min j(x) + ^P,(/'i)^[ (7.12) 



The optimal values of the variables are found as the limiting 
values when P becomes very large. In this case, X2*= 1.071 
and, subsequently, from Eq. 7.15b X|*= 1.286; these are the 
optimal solutions sought. ■ 



where P is called the penalty factor for condition i with k 
being the number of constraints. The choice of this penal- 
ty factor provides the relative weighting of the constraint 
compared to the function. For high P. values, the search will 
satisfy the constraints but move more slowly in optimizing 
the function. If P. is too small, the search may terminate 
without satisfying the constraints adequately. The penalty 
factor can assume any function, but the nature of the pro- 
blem may often influence the selection. For example, when 
a forward model is being calibrated with experimental data, 
one has some prior knowledge of the numerical values of 
the model parameters. Instead of simply performing a ca- 
libration based on minimizing the least square errors, one 
could frame the problem as an unconstrained penalty factor 
problem where the function to be minimized consists of a 
term representing the root sum of square errors, and of the 
penalty factor term which may be the square deviations of 
the model parameters from their respective estimates. The 
following example illustrates this approach while it is furt- 
her described in Sect. 10.5.2 when dealing with non-linear 
parameter estimation. 

Example 7.3.4: Minimize the following problem using the 
penalty function approach: 



y — 5xj + 4^2 s.t. 3xi + 2X2 



(7.13) 



Let us assume a simple form of the penalty factor and frame 
the problem as: 



J* = min (/) = min [3; + P{hf] 

= min [5xf + 4xf + P(3xi + 2x2 - 6)^] 



(7.14) 



9/ 



Then: = lOxi + 6f (3xi + 2x2 - 6) = (7.15a) 

9xi 

97 

and = 8x2 + 4f (3xi + 2x2 - 6) = (7.15b) 

9x2 



Solving these results in xi 



6x2 



which, when substituted back into the constraint of 
Eq. 7.13, yields: 



7.4 Numerical Search Methods 

Most practical optimization problems will have to be solved 
using numerical search methods. This implies that the search 
towards an optimum is done systematically and progressi- 
vely using an iterative approach to gradually zoom onto the 
optimum. Because the search is performed at discrete points, 
the precise optimum will not be known. The best that can 
be achieved is to specify an interval of uncertainty which 
is the range of x values in which the optimum is known to 
exist. The search methods differ depending on whether the 
problem is univariate/multivariate or unconstrained/constrai- 
ned or have continuous/discontinuous functions. Numerous 
search methods have been proposed ranging from the ex- 
haustive search method to genetic algorithms. 

A general method is the lattice search (Fig. 7.9) where one 
starts at one point in the search space (shown as point 1), cal- 
culates values of the function in a number of points around 
the initial point (points 2-9), and moves to the point which 
has the lowest value (shown as point 5). This process is repea- 
ted till the overall minimum is found. Sometimes, one may 
use a coarse grid search initially; find the optimum within an 
interval of uncertainty, then repeat the search using a finer 




Fig. 7.9 Conceptual illustration of finding a minimum point of a bi-va- 
riate function using a lattice searcli mettiod. From an initial point 1 , the 
best subsequent move involves determining tlie function values around 
that point at discrete grid points (points 2 through 9) and moving to the 
point with the lowest function value 
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grid. Note that this is not a calculus-based method, and is not 
very efficient computationally, especially for higher dimensi- 
on problems. However, it is robust and simple to implement. 
Calculus-based methods are generally efficient for pro- 
blems with continuous functions; these are referred to as 
hill-climbing methods. Strictly speaking, hill-climbing le- 
ads one to a maximum; the algorithm is easily modified for 
valley-descending as needed for minimization. Two general 
approaches are described below while other methods can be 
found in Stoecker (1989) or Beveridge and Schechter (1970). 

(a) Univariate search (Fig. 7.10): this is a calculus based 
method where the function is optimized with respect to 
one variable at a time. One starts by using some preli- 
minary values for all variables other than the one being 
optimized, and finds the optimum value for the selected 
variable (shown as Xj). One then selects a second variab- 
le to optimize while retaining this optimal value of the 
first variable, and finds the optimal value of the second 
variable, and so on for all remaining variables. Though 
the problem reduces to a simple univariate search, it may 
be computationally inefficient for higher dimension pro- 
blems, and worse, may not converge to the global opti- 
mum if the search space is not symmetrical. The entire 
process often requires more than one iteration, as shown 
in Fig. 7.10. 

(b) Steepest-descent search (Fig. 7.1 1): this is a widely used 
approach because of its efficiency. The computational 
algorithm involves three steps: one starts with a guess 
value (represented by point 1) which is selected somew- 
hat arbitrarily but, if possible, close to the optimal va- 
lue; one then evaluates the gradient of the function at the 
current point by computing the partial derivatives either 
analytically or numerically; and, finally, one moves 
along this gradient (hence, the terminology "steepest") 
by deciding, somewhat arbitrarily, on the step size. The 




Fig. 7.10 Conceptual illustration of finding a minimum point of a bi- 
variate function using the univariate search method. From an initial 
point 1 , the gradient of the function is used to find the optimal point 
value of X keeping x fixed, and so on till the optimal point 5 is found 




1 ' 
Starting choice of Xl 



Fig. 7.11 Conceptual illustration of finding a minimum point of a bi- 
variate function using the steepest descent search method. From an in- 
itial point 1, the gradient of the function is determined and the next 
search point determined by moving in that direction, and so on till the 
optimal point 4 is found 



relationship between the step sizes Ajc and the partial de- 
rivatives {dy/dxi)is: 



Ax] 



Ax2 



dy/dxi dy/dx2 



Axj 
dy/dxj 



(7.16) 



Steps 2-3 are performed iteratively till the minimum (or 
maximum) point is reached. A note of caution is that too 
large a step size can result in numerical instability; while 
too small a step size increases computation time. 
Extensions of this general approach to optimization pro- 
blems with non-linear constraints and with inequality cons- 
traints are also well developed, and will not be discussed 
here. The reader can refer, for example, to Stoecker (1989) 
or to Beveridge and Schechter (1970) for the widely used 
non-linear hemstitching method which have its roots in the 
Lagrangian approach. 

Example 7.4.1: Illustration of the univariate search method 
Consider the following function with two variables which is 
to be minimized using the univariate search process starting 
with an initial value of x, = 3. 



y =xi 



16 

X\.X2 



X2 



First, the partial derivatives are derived: 



dx\ 



16 

2 
Xl.X^ 



and 



9x2 



16 

xi.xf 



Next, the initial value oix^=2> is used to find the next iterati- 
ve value of X from the {dy/dx\) function as follows: 
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1 



16 



from where 



xi = 



4V3 



= 2.309 



(3).xf ' 3 

The other partial derivative is finally used with this value of 
Xj to yield: 

16 1 



from where Xj — 3.722 



The new value of x^ is now used for the next cycle, and the 
iterative process repeated until consecutive improvements 
turn out to be sufficiently small to suggest convergence. 
It is left to the reader to verify that the optimal values are 

(x* =2,^2* =4). ■ 

Example 7.4.2: Illustration of the steepest descent method 
Consider the following function with three variables to be 
minimized: 



y ■ 



12x\ 

X2 



360 

XiXt, 



■ x\X2 + 2x3 



Assume a starting point of (x^ = 5, x^ = 6, x^ = 8). At this point, 
the value of the function is y = 1 15. 

These numerical values are inserted in the expressions for 
the partial derivatives: 



9x1 

9.X2 

dy_ 

9X3 



72 360 

~ 2 

72x1 



72 



■X2 = 



360 

xix| 



■Xl 



+ 2 = -- 



6 

72(5) 

(6)^ 
360 

(5X8? 



360 

(8).(5)' 

5 = -5 



6= 16.2 



0.875 



In order to compute the next point, a step size has to be as- 
sumed. Let us arbitrarily assume Axj = -1 (whether to take a 
positive or a negative value for the spatial step is not obvi- 
ous — in this case a minimum is sought, and the reader can 
verify that taking a negative value results in a decrease in the 
function value y). 

— 1 Ax2 Ax3 

16^ ~^ ~ 0.875 
from where Ax2=0.309, Ax^ = -0.054. Thus, the new point is 

(Xj=4, X2 = 6.309, x^ = 7.946). The reader can verify that the 

new point has resulted in a decrease in the functional value 

from 115 to 98.1. Repeated use of the search method will 

gradually result in the optimal value being found. ■ 



Applying Eq. 7.16 results in 



these, and would require a formal mathematically-based 
numerical approach to find the optimal solution. Numerical 
efficiency (or power) of a method of solution involves both 
robustness of the solution and fast execution times. Optimi- 
zation problems which can be framed into a linear problem 
(even at the expense of a little loss in accuracy) have great 
numerical efficiency. Only if the objective function and the 
constraints are both linear functions is the problem designa- 
ted as a linear optimization problem; otherwise it is deemed 
a non-linear function. The objective function can involve 
one or more functions to be either minimized or maximized 
(either objective can be treated identically since it is easy to 
convert one into the other). 

There is a great deal of literature on methods to solve li- 
near programs, which are referred to as linear programming 
methods. The Simplex algorithm is the most popular techni- 
que for solving linear problems and involves matrix inversion 
along with directional iterations; it also provides the neces- 
sary information for performing a sensitivity analysis at the 
same time. Hence, formulating problems as linear problems 
(even when they are not strictly so) has a great advantage in 
the solving phase. 

The standard form of linear programming problems is: 



minimize /(x) 



T 
C X 



subject to : g(x) : Ax = b 



(7.17a) 
(7.17b) 



where x is the column vector of variables of dimension n, b 
that of the constraint limits of dimension m, c that of the cost 
coefficients of dimension n, and A is the (m x n) matrix of 
constraint coefficients. Notice that no inequality constraints 
appear in the above formulation. This is because inequali- 
ty constraints can be re-expressed as equality constraints by 
introducing additional variables, called slack variables. The 
order of the optimization problem will increase, but the ef- 
ficiency in the subsequent numerical solution approach out- 
weighs this drawback. The following simple example serves 
to illustrate this approach. 

Example 7.5.1: Express the following linear two-dimensio- 
nal problem into standard matrix notation: 

Maximize /(x) : 3186 + 620xi + 420x2 (7.18a) 

gi(x): 0.5x1+ 0.7x2 < 6.5 

subject to g2(x) : 4.5xi - X2 < 35 (7.18b) 

g3(x) : 2.1x1 +5.2x2 < 60 



7.5 Linear Programming 

The above sections were meant to provide a basic background 
to optimization problems using rather simplistic examples. 
Most practical problems would be much more complex than 



with range constraints on the variables Xj and x^ being that 
these should not be negative. 

This is a problem with two variables (x^ and x^). However, 
three slack variables need to be introduced in order to refra- 
me the three inequality constraints as equality constraints. 



7.6 Quadratic Programming 
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This makes the problem into one with five unknown variab- It can be written in matrix form as: 
les. The three inequaUty constraints are rewritten as: 



gi(x) 



0.5xi + 0.7x2 + X3 =6.5 
4.5xi — X2 + X4 = 35 
2.1xi + 5.2x2 + X5 = 60 



(7.19) 



Hence, the terms appearing in the standard form (eq. 7.17) 
are: 

c = [ -620 -420 0]^, 

X = [ Xi X2 X3 X4 X5 ]^ , 

0.5 0.7 1 
4.5 -10 10 
2.1 5.2 1 

b = [ 6.5 35 60 f 

Note that the objective function is recast as a minimization 
problem simply by reversing the signs of the coefficients. 
Also, the constant does not appear in the optimization sin- 
ce it can be simply added to the optimal value of the func- 
tion at the end. Step-by-step solutions of such optimization 
problems are given in several textbooks such as Edgar et al. 
(2001), Hillier and Liberman (2001) and Stoecker (1989). 
A commercial optimization software program was used to 
determine the optimal value of the above objective function: 

/*(x) = 9803.8 

Note that in this case, since the inequalities are "less than or 
equal to zero", the numerical values of the slack variables 
(Xj, x^, x^) will be positive. The optimal values for the prima- 
ry variables are: X* = 8.493, x| — 3.219, while those for the 
slack variables are x| = 0,X4 = OjXj = 25.424 (implying 
that constraints 1 and 2 in Eq. 7.18b have turned out to be 
equality constraints). ■ 



7.6 Quadratic Programming 

A function of dimension n (i.e., there are n variables) is said 
to be quadratic when: 

/(x) = auXi + ai2XiX2 + . . . + aijXiXj + . . . 

n n 
( = 1 ./ = 1 

where the coefficients are constants. Consider the function 
which is quadratic in two variables: 

/(xi,X2) = 4xf + 12xiX2 - 6x2x1 - 8xf (7.21) 



/(X1,X2) = [Xi X2] 



4 -6 
12 -8 



Because 12xiX2 — 6x2X1 — 6x1x2, the function can also be 
written as: 



/(Xl,X2) = [Xl X2 ] 



4 6 
6 -8 



Xl 

X2 



Thus, the coefficient matrix of any quadratic function can be 
written in symmetric form. 

Quadratic programming problems differ from the linear 
ones in only one aspect: the objective function is quadratic 
in its terms (while constraints and equalities must be linear). 
Even though such problems can be treated as non-linear pro- 
blems, formulating the problem as a quadratic one allows 
for greater numerical efficiency in finding the solutions. Nu- 
merical algorithms to solve such problems are similar to the 
linear programming ones; a modified Simplex method has 
been developed which is quite popular. 

The standard notation is: 



1 -r 
Minimize /(x) = ex H — x Qx 

Subject to : g(x) : Ax = b 



(7.22) 
(7.23) 



Note that the coefficient matrix Q is symmetric, as explained 
above. 

Example 7.6.1: Express the following problem in standard 
quadratic programming formulation: 

Min / = 4xf + 4xf + 8x1x2 - 60xi - 45x2 (7-24) 

subject to 2xi + 3x2 = 30 
In this case: 

c = [-60 -45], x = [xi X2]^, 



A =[2 3], b=[30] 
It is easy to verify that: 



x^Qx = [xi X2] 



= [(8x1 - 8x2)(-8xi + 8x2)] 
= Sxj — 16x1X2 + 8x2 
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The reader can verify that the optimal solution corresponds to: 

X* = 3.75, x| = 7.5 which results in a value of J*=-56.25 7.8 
for the objective function. ■ 



Illustrative Example: Combined Heat 
and Power System 



7.7 Non-linear Programming 

Non-linear problems are those where either the objective 
function or any of the constraints are non-linear. Such pro- 
blems represent the general case and are of great interest. 
A widely used notation to describe the complete nonlinear 
optimization problem is to frame the problem as: 

Minimize y=y{'ii) objective function (7.25) 

subject to h{\) = equality constraints (7.26) 

g(x) < inequality constraints (7.27) 

I <x < u. range or boundary constraints (7.28) 

where x is a vector of p variables. 

The constraints /!(x) and g(x) are vectors of independent 
equations of dimension mj and m, respectively. If these cons- 
traints are linear, then the problem is said to have linear cons- 
traints; otherwise it is said to have non-linear constraints. 
The constraints /. & u. are lower and upper bounds of the 
decision variables of dimension m^. Thus, the total number 
of constraints is m = m, +m^ + m^. 

^ 1 Z J 

Several software codes have been developed to solve 
constrained non-linear problems which involve rather so- 
phisticated numerical methods. The two most widely used 
techniques are: (i) the sequential quadratic programming 
(SQP) method which uses a Taylor-series quadratic expan- 
sion of the functions around the current search point, and the 
solution is then found successively; and (ii) the generalized 
reduced gradient (GRG) method which is a sophisticated 
search method whose basic approach was previously descri- 
bed in Sect. 7.4. The interested reader can refer to Venkata- 
raman (2002) for a detailed description of these as well as 
other numerical optimization methods. 

A note of caution is needed at this stage. Quite often, the 
optimal point is such that some of the constraints turn out 
to be redundant (but one has no way of knowing that from 
before), and even worse that the problem is found to have 
either no solution or an infinite number of solutions. In such 
cases, for a unique solution to be found, the optimization 
problem may have to be reformulated in such a manner that, 
while being faithful to the physical problem being solved, 
some of the constraints are relaxed or reframed. This is ea- 
sier said than done, and even the experienced analyst may 
have to evaluate alternative formulations before deciding on 
the most appropriate one. 



Combined Heat and Power (CHP) components and sys- 
tems are described in several books and technical papers 
(for example, see Petchers 2003). Such systems meant for 
commercial/institutional buildings (BCHP) involve mul- 
tiple prime movers, chillers and boilers and require more 
careful and sophisticated equipment scheduling and cont- 
rol methods as compared to those in industrial CHP. This is 
due to the large variability in building thermal and electric 
loads as well as the equipment scheduling issue. Equipment 
scheduling involves determining which of the numerous 
equipment combinations to operate, i.e., is concerned with 
starting or stopping prime movers, boilers and chillers. The 
second and lower level type of control is called superviso- 
ry control which involves determining the optimal values 
of the control parameters (such as loading of primemovers, 
boilers and chillers) under a specific combination of equip- 
ment schedule. The complete optimization problem, for a 
given hour, would qualify as a mixed-integer programming 
problem because of the fact that different discrete pieces of 
equipment may be on or off. The problem can be tackled by 
using algorithms appropriate for mixed integer programm- 
ing where certain variables can only assume integer values 
(for example, for the combinatorial problem, a certain piece 
of equipment can be on or off — which can be designated as 
or 1 respectively). Usually, such algorithms are not too 
efficient, and, a typical approach in engineering problems 
when faced with mixed integer problems is to treat integer 
variables as continuous, and solve the continuous problem. 
The optimal values of these variables are then simply found 
by rounding to the nearest integer. In the particular case of 
the BCHP optimization problem, another approach which 
works well for medium sized situations (involving, say, up 
to about 100 combinations) is to proceed as follows. For 
a given hour specified by the climatic variables and the 
building loads, all the feasible combinations of equipment 
are first generated. Subsequently, a lower level optimization 
is done for each of these feasible equipment combinations, 
from which the best combination of equipment to meet the 
current load can be selected. 

Currently, little optimization of the interactions among 
systems is done in buildings. Heuristic control normally used 
by plant operators often results in off-optimal operation due 
to the numerous control options available to them as well 
as due to dynamic, time- varying rate structures and relative 
changes in gas and electricity prices. Though reliable esti- 
mates are lacking in the technical literature, the consensus is 
that 5-15% of cost savings can be realized if these multiple- 
equipment BCHP plants were operated more rationally and 
optimally. 



7.8 Illustrative Example: Combined Heat and Power System 
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Fig. 7.12 Generic schematic of a 
combined heat and power (CHP) 
system meant to supply cooling, 
heating and part of the electricity 
needs of a building. Sub-system 
interactions and nomenclature 
used are also shown. The terms 
xl, x2, x3 and x4 ai'e control 
variables which represent the 
loading fractions of the prime- 
mover, boiler, vapor compression 
chiller and the absorption chiller 
respectively. (From Reddy and 
Maor 2009) 
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Figure 7.12 is a generic schematic of how the important 
subsystems of a BCHP system (namely, primemovers, va- 
por compression chillers, absorption chillers and boilers) are 
often coupled to serve the building loads, (Reddy and Maor 
2009). The static optimization case, without utility sell-hack, 
involves optimizing the operating cost of the BCHP system 
for each time step, i.e. each hour, while it meets the building 
loads: the non-cooling electric load (Egj^ ),the thermal co- 
oling load (Q^) and the building thermal heating load (Q^^). 
Assume that the cost components only include steady state 
hourly energy costs for electricity and gas. So, the quantity 
to be minimized, J, is the total cost of energy consumption, 
summed over all components that are operating plus the 
equipment operation and maintenance (O&M) costs. 

The objective function to be optimized for a particular 
time step (or hour) and for a specific BCHP system combi- 
nation: 

/* =min{7} = min{/i -I-/2 + /3} (7-29) 

where 

- the cost associated with gas use is 

J\ — (Gcen + GBP)-Cg 



the cost associated with electric use is 

^2 — ^ Purchase-^e 



(7.30) 



the O&M cost is 



^3 = Mom 



subject to the inequality constraints that the building loads 
must be met (called functional constraints^): 
- building thermal cooling load 



Qac + Qvc > Qc 

- building thermal heating load 

QbP + ticen — Hac > Qh 

- building non-cooling electric load 

{Epurchase + E(jg,j — Eye — E p) > Egi^g 

and subject to boundary or range constraints that 

- the primemover part load ratio 

^Gen^min _ ^Gen _ f •'J 

- the vapor compression chiller part load ratio 

XFCmin < Xvc < 1-0 

- the absorption chiller part load ratio 

XACmm <XaC < 1-0 

- the boiler plant part load ratio 

XBP,min < ^BP < 1-0 



(7.31) 



(7.32) 



' The inequality constraints allow for energy dumping. In practice, this 
is almost never done, in which case, one could recast the three cons- 
traints of Eq. 7.31 as equality constraints. 
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where: 

C^ unit energy cost of electricity use 

C unit energy cost of natural gas 



Gen 



actual electric power output of primemover 



E„ , amount of purchased electricity 

Purchase J^ -' 

E parasitic electric use of the BCHP plant (pumps, 

fans, etc.) 
Ey^ electricity consumed by the vapor compression 

chiller 



BP 



amount of natural gas heat consumed by the boiler 
plant 
G(^,^j^ amount of natural gas heat consumed by the prime- 
mover 
H^i, heat supplied to the absorption chiller 
H^,^_^ total recovered waste heat from the primemover 
Mjjj^ operation and maintenance costs of the BCHP 

equipment which are operated 
Q^^ amount of cooling supplied by the absorption chiller 

Qgp amount of heating supplied by the boiler plant 

Qy(, amount of cooling supplied by the vapor compres- 

sion chiller 
X part load ratios of the four major pieces of equip- 

ment, i.e., actual load divided by their rated capa- 
city. The subscripts: AC — absorption chiller, BP — 
boiler plant. Gen — generator, VC — vapor com- 
pression chiller. Note that the lower bound values 
of the range constraints in Eq. 7.32 are specific to 
the type of equipment and are limits below which 
such equipment should not be operated. 
One also needs to model the energy consumption for each 
of the components as a function of the component's charac- 
teristics and of the controlled variables. The models to be 
used for optimization can be of three types: 

(a) detailed simulation models originally developed for pro- 
viding insights into design issues and which are most ap- 
propriate for research purposes; 

(b) semi-empirical component models that combine deter- 
ministic modeling involving thermodynamic and heat 
transfer considerations with some empirical curve-fit 
models so as to provide some degree of modeling detail 
of sub-components of the major equipment, such as the 
effect of back-pressure on turbine performance, indivi- 
dual heat exchanger performance, power for gas com- 
pression,...; 

(c) semi-empirical inverse models, which can be either grey- 
box or black-box depending on whether the underlying 
physics is used during model development. The tradi- 
tional black-box approach using rated equipment per- 
formance along with polynomial models to capture part 
load performance is illustrated below in view of its con- 
ceptual simplicity. 

A simple manner of modeling part-load performance of 
various equipment is given below. Part-load electrical effi- 



ciency of reciprocating engines and microturbines can be 
modeled as: 



yCen = flO + a\.Xcen + «2-^Got 



(7.33) 



where y^^^^ is the relative electrical efficiency = (actual effi- 
ciency/rated efficiency), 

XGen is the relative power output = (actual power/rated power) 

= (£g.„/£g.J (7.34) 

with the supbscript (") denoting rated conditions. 

The numerical values of the part-load model coefficients 
are known from manufacturer data. Since electrical efficien- 
cy of a prime mover is taken to be the electrical power output 
divided by the gas heat input, the expression for the natural 
gas heat input is: 



G Cm 1 



^Gen — ^Gen ■ 



^Gen — ^ Qgn-^Gen- 



E" Gen yCe 



1 



(flO -\-ai.XGe„ +a2.XGen ) 



(7.35a) 



(7.35b) 



The amount of waste heat which can be recovered from the 
primemover under part-load conditions is also needed du- 
ring the simulation. Under part-load, primemover electrical 
efficiency degrades, and consequently a larger fraction of the 
supplied gas energy will appear as waste thermal heat. If one 
assumes that the primemover is designed such that the ratio 
of recovered waste heat to total waste heat is constant during 
its entire operation, then to a good approximation, one can 
model the recovered thermal energy under part-load opera- 



tion H^ as: 



H, 



H' 



Gen 



Gen 



Gcen 



G"Gen jGe 



(7.36) 



where y^,^^ is the relative efficiency defined earlier by 
Eq. 7.33. " 

Chiller part-load performance factor (PLF) can be mode- 
led as (Braun 2006): 



PLF ^bo-\-bi.PTR-\-b2.PTR^ 

+ biPLR + b^PLR^ + bs.PLR.PTR 



(7.37) 



where the numerical values of the model coefficients b can 

1 

be found from manufacturer data. Since the type of power 
input to the vapor compression and absorption chillers are 
different, the PLF for vapor compression and absorption 
chillers are defined as: 
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PLFvc = yvc 



Eye ti/^Q 

and PLFac = Jac 






J-J* 



{13%) 

where Ey^^ is the electric power consumed by the vapor com- 
pression chiller, and H^^, is the thermal heat input to the ab- 
sorption chiller. 

Instead of using the symbol PLR which is the part-load 
ratio = (actual thermal cooling load / rated thermal cooling 
load), the symbols x^^ and x^^ can be used to denote the part 
load ratios of the vapor compression and absorption chillers 
respectively. 



, . Qvc ( Qac 
xvcior XAc) = -: I or 



Q* 



vc 



Q' 



AC 



(7.39) 



Finally, PTR is the part-load temperature ratio of the entering 

T 

condenser water temperature = — !— !- 

The boiler efficiency is also conveniently modeled follo- 
wing polynomial relations: 

ysp — co + ciXbp + ciXBp^ (7.40a) 

where 

Xgp is the part load ratio = boiler heat output by its rated 
value 



'^BP 

Q*BP 



(7.40b) 



y^p is the heat input ratio or the ratio of fuel energy input 
to heat output under operating condition to that under rated 
condition 



Q^p 



G* 



BP 



Q*B[ 



(7.40c) 



Rearranging, one gets 

Gbp = G*BP. (^^) (^1 + diXBP + d^XBp"-) (7-41) 

The optimization for a given period (say, a hour) would in- 
volve determining the numerical values of the four part load 
ratios (Eq. 7.32), which minimize the cost of operation whi- 
le meeting the stated constraints. The four optimal part load 
ratios would allow E*^^^, Q*yQ, Q\c' Q*bp to be determined 
from where £^p„rc/M.5c' ^Gch' ^sp can be deduced to finally 
yield J*. From a practical viewpoint, the BCHP plant has to 
be optimally controlled over a time horizon (which could be 
a day, or several hours during a day) and not just over a gi- 
ven period. This requires that the optimization be redone for 
each time period in order to achieve optimal control over the 
entire time horizon. In practice, there are a number of ope- 
rational constraints in equipment scheduling (start-up losses. 



standby operation,...) which make the problem more com- 
plex than the simple static optimization approach described 
above (Reddy and Maor 2009). 



7.9 Global Optimization 

Certain types of optimization problems can have local (or 
sub-optimal) minima, and the optimization methods descri- 
bed earlier can converge to such local minima closest to the 
starting point, and never find the global solution. Further, 
certain optimization problems can have non-continuous first- 
order derivatives in certain search regions, and calculus ba- 
sed methods break down. Global optimization methods are 
those which can circumvent such limitations, but can only 
guarantee a close approximation to the global optimum (of- 
ten this is not a major issue). Unfortunately, they generally 
require large computation times. These methods fall under 
two general categories (Edgar et al. 2001): 

(a) Exact methods which include such methods as the 
branch-and-bound-methods and multistart methods. 
Most commercial non-linear optimization software pro- 
grams have the multistart capability built-in whereby the 
search for the optimum solution is done automatically 
from many starting points. This is a conceptually simp- 
le approach though its efficient implementation requires 
robust methods of sampling the search space for starting 
points that do not converge to the same local optima, and 
also to implement rules for stopping the search process; 

(b) Heuristic search methods are those which rely on some 
rules of thumb or "heuristics" to gradually reach an opti- 
mum, i.e. an iterative or adaptive improvement algorithm 
is central. They incorporate algorithms which circum- 
vent the situation of non-improving moves and disallow 
previously visited states from being revisited. Again, 
there is no guarantee that a global optimum will be rea- 
ched, and so often the computation stops after a certain 
number of computations have been completed. There are 
three well-known methods which fall in this category: 
Tabu search, simulated annealing (SA), and genetic al- 
gorithms (GA). Because of its increasing use in recent 
years, the last method is briefly described below. 

While Tabu search and simulated annealing operate by 
transforming a single solution at a given step, GA works with 
a set of solutions P(x ) called a population consisting of an 
array of individual members x , also called chromosomes, 
which are defined by a certain number of parameter values 
"p" (Burmeister 1998). This p-dimension problem is to be 
minimized with variables which could be binary or continu- 
ous. The GA algorithm is meant to work with: (i) unconstrai- 
ned problems (constrained problems need to be reframed, 
for example, by adopting a penalty factor approach), and (ii) 
with binary variables (a continuous variable can be conver- 
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ted to higher order binary variable by discretizing its range of 
variability into m ranges and defining m new binary variables 
to replace the one continuous one). An initial population of 
size 2n to 4n starting vectors (or chromosomes or strings) is 
selected (heuristically) as starting values. An objective func- 
tion or fitness function to be minimized is computed for each 
initial or parent string, and a subset of the strings which are 
"fitter" i.e., which yield a lower numerical value of the ob- 
jective function to be minimized is retained. Successive ite- 
rations (or generations) are performed by either combining 
two or more fit individuals (called crossover) or by changing 
an individual (called mutation) in an effort to gradually mi- 
nimize the function. This procedure is repeated several thou- 
sands of time until the solution converges. Because the ran- 
dom search was inspired by the process of natural selection 
underlying the evolution of natural organisms, this optimiza- 
tion method is called genetic algorithm. Clearly, the process 
is extremely computer intensive, especially when continuous 
variables are involved. Sophisticated commercial software is 
available, but the proper use of this method requires some 
understanding of the mathematical basis, and the tradeoffs 
available to speed convergence. 



7.10 Dynamic Programming 

Dynamic programming is a recursive technique developed to 
handle a type of problem where one is optimizing a trajec- 
tory rather than finding an optimum point. The term "dyna- 
mic" is used to reflect the fact that subsequent choices are af- 
fected by earlier ones. Thus, it involves multi-stage decision 
making of discrete processes or continuous functions which 
can be approximated or decomposed into stages. It is not a 
simple extension of the "static" or single-stage optimization 
methods discussed earlier, but one which, as shown below, 
involves solution methods that are much more computatio- 
nally efficient. Instead of solving the entire problem at once, 
the sub-problems associated with individual stages are sol- 
ved one after the other. The stages could be time intervals 
or spatial intervals. For example, determining the optimal 
flight path of a commercial airliner travelling from city A 
to city B which minimizes fuel consumption while taking 
into consideration vertical air density gradients (and hence, 
drag effects), atmospheric disturbances and other effects is a 
problem in dynamic programming. 

Consider the following classic example of a travelling 
salesman, which has been simplified for easier conceptual 
understanding (see Fig. 7.13). A salesman starts from city A 
and needs to end his journey at City D but he is also requi- 
red to visit two intermediate cities B and C of his choosing 
among several possibilities (in this problem, three possibili- 
ties: Bl, B2, B3 at stage B; and CI, C2 and C3 at stage C). 
The travel costs to each of the cities at a given stage, from 




Fig. 7.13 Flow paths for the traveling salesman problem who starts 
from city A and needs to reach city D with the requirement that he visit 
one city among the three options under groups B and C 



each of the cities from the previous stage, are specified. Thus, 
this problem consists of four stages (A,B,C and D) and three 
states (three different possible cities). The computational al- 
gorithm involves starting from the destination (city D) and 
working backwards to starting city A (see Table 7.1). The 
first calculation step involves adding the costs involved to 
travel from city D to cities CI, C2 and C3. The second calcu- 
lation step involves determining costs from D through each 
of the cities CI, C2 and C3 and on to the three possibilities 
at stage B. One then identifies the paths through each of the 
cities CI, C2 and C3 which are the minimum (shown with 
an asterix). Thus, path D-C1-B2 is cheaper than paths D- 
Cl-Bl and D-C1-B3. For the third and final step, one limits 
the calculation to these intermediate sub-optimal paths and 
performs three calculations only. The least cost path among 
these three is the optimal path sought (shown as path D-C3- 
Bl-A in Table 7.1). Note that one does not need to compute 
the other six possible paths at the third stage, which is where 
the computational savings arise. 



Table 7.1 Solution approach to the travelling salesman problem. Each 
calculation step is associated with a stage and involves determining the 
cumulative cost till that stage is reached 



Start 


First calcula- 
tion step 


Second calcula- 
tion step 


Third calcula- 
tion step 


Optimal 
path 


D 


D-Cl 


D-Cl-Bl 










D-C1-B2* 


D-C1-B2-A 




D-C1-B3 




D-C2 


D-C2-B1 










D-C2-B2* 


D-C2-B2-A 




D-C2-B3 




D-C3 


D-C3-B1* 


D-C3-B1-A* 


<- 


D-C3-B2 


D-C3-B3 



Note: The cells with a * are the optimal paths at each step 
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It is obvious that the computational savings increase for 
problems with increasing number of stages and states. If all 
possible combinations were considered for a problem invol- 
ving n intermediate stages (excluding the start and end sta- 
ges) with m states each, the total number of enumerations 
or possibilities would be about (m"). On the other hand, for 
the dynamic programming algorithm described above, the 
total number would be approximately n(m x n). Thus, for 
n=m=4, all possible routes would require 256 calculations 
as against about 64 for the dynamic programming algorithm. 

Basic features which characterize the dynamic programm- 
ing problem are (Hillier and Lieberman 2001): 

(a) the problem can be divided into stages with a policy de- 
cision made at each stage, 

(b) each stage has a number of states associated with the be- 
ginning of that stage, 

(c) the effect of the policy decision at each stage is to trans- 
form the current state to a state associated with the be- 
ginning of the next stage, 

(d) the solution procedure is to divide the problem into sta- 
ges, and given the current state at a certain stage, to find 
the optimal policy /or the next stage only among all futu- 
re states, 

(e) the optimal policy of the remaining stages is independent 
of the optimal policies selected in the previous stages. 

Dynamic programming has been applied to a large num- 
ber of problems; to name a few, control of system opera- 
tion over time, design of equipment involving multistage 
equipment (such as heat exchangers, reactors, distillation co- 
lumns,...), equipment maintenance and replacement policy, 
production control, economic planning, investment. One can 
distinguish between deterministic and probabilistic dynamic 
programming methods, where the distinction arises when 
the next stage is not completely determined by the state and 
policy decisions of the current stage. Only deterministic pro- 
blems are considered in this section. Examples of stochastic 
factors which may arise in probabilistic problems could be 
uncertainty in future demand, random equipment failures, 
supply of raw material, . . . 

Example 7.10.1:'' Example of determining optimal equip- 
ment maintenance policy. 

Consider a facility which generates electricity from a pri- 
me-mover such as a reciprocating engine or a gas-turbine. 
The electricity is sold to the electric grid which results in an 
income to the owner However, the efficiency of the prime- 
mover gradually degrades over time so that the net income 
generated from electricity sales reduces over time. A major 
service costing $ 21 (thousand) is required to bring the pri- 
memover up to its original performance, and this is needed at 



least once every 5 years. The net sales income (in thousands 
of dollars/year) is given by the following equation: 



5' = 25 



for < n < 4 



(7.42) 



where n is the number of years from last major service. Hen- 
ce, service is mandatory at the end of 5 years, but service may 
be required more frequently to maximize profits. The time 
value of money is not considered in this simple example. 

The intent of the problem is to determine the maintenance 
schedule of this equipment, at a time when it is two years old, 
which maximizes cumulative net profit over a 4 year period. 
The annual profit P(n) is given by: 
- at the end of the year when no servicing is done: 



P(n) = 25 



for < « < 4 



(7.43) 



- at the end of the year when servicing is done: 
P(n = 0) = 25 -0-21 =4 

It is clear that one is seeking to maximize a sum, namely 

N 

J^ — Y^ Pi{n) where N=4 (the time horizon for this pro- 

blem) and i is the index representing the time period into this 
time horizon. The index i should not be confused with the 
index n. This is clearly a multi-stage problem with numerous 
combinations possible, and well suited to the dynamic pro- 
gramming approach. At the end of each year, one is required 
to make a decision whether to continue as is or whether to 
have service performed. Recalling one of the basic features 
of the dynamic programming approach, namely that the op- 
timal policy for a certain stage is independent of the optimal 
policies selected in the previous stages, one can frame the 
problem as a recursive equation for the optimal cumulative 
sum J*: 

/;(n,+i) = max{P,(n,) + -//-iC",)} (^'44) 

where n denotes the number of years after the last service 
corresponding to the current stage (or year) in the time ho- 
rizon. 

The procedure and the results of the calculation are shown 
in Table 7.2. Note how the calculation starts from the second 
column and recursively fills the subsequent cells. For exam- 
ple, consider the case when one is 3 years after the last service 
was done and 2 years of operation left. Then: 

J*0) = max [{P(3) + Ji*(4)}, [PiO) + yfd)}] 
if no service if service done 

or /2*(3) = max[{16 + 9},{4 + 24)] = 28 



* From Beveridge and Schechter (1970) by permission of McGraw- 
Hill. 



This value is the one shown in the table with the indication 
that performing service is the preferred option. 
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Table 7.2 Table showing net 
cumulative profit ($ — thousands). 
The bolded numbers and the ar- 
rows indicate the optimal annual 
policy decisions over a time ho- 
rizon of 4 years of operation at a 
time 2 years after the last service 
to the equipment was performed 





Years of operation left | 


Age of unil 
from time of 
service- n (yr) 


1 


2 


3 


4 


' 


Max(25-I-,4) 

= 24 (N) 


Max( 24-1-2 1,44-24) 
=45(N) 


Max (45+ 16,4+24+21) 
,=61 (N) 'R 


Max(61+9,4+24+21 + 16) 

=70 (N) 


2 


Max (25-2-, 4) 
=21 (N) 


Miix(2i+I5,4-I-24L/ Max(37-t-9,4-i-24-i-21) ^^.\1ax(49-h4,4+24-l-21-H6) 
,=37(N) =49 (N or SI N selecled =65 (S) 


3 


Max(25-3'.4) *" 
=16 (N) 


Max(l6-^9,4-l-24) 

=28(S) 


Max(28+4,4+24+2r) 
=49 (S) 


Max(49+4,4-i-24+2H16) 
=65(S) 


4 


Max(2J-4-,4) 
=9 (N) 


Ma>;(y-l-4,4-^24) 
=2S(S) 


Max(28+4,4-i-24+2r) 
=49 (S) 


Max(49+4,4+24+21-!-16) 
=65{S) 


5 


25-l)-2l 
= 4 (S) 


Ma\(4+21.4+24) 
=2S(S) 


Max(28+4,4+24-t'21) 
=49 (S) 


M3X(49+4,4+24+2H-16) 
=65(S) 



Note: N* tio service, S- service done 



The optimal policy decisions (whether to perform ser- 
vice or not) at each of the 4 years of the time horizon are 
shown bolded in Table 7.2. The path indicated by arrows is 
the optimal one. As the equipment is already 2 years old, one 
starts from the extreme right column at the second row. Sin- 
ce service has been done, one moves up to the third column 
since the age from time of service has now reduced to one. 
The next step should end in the cell corresponding to age of 
unit=2 since no service is done at this stage (year 3). Simi- 
larly, the last step takes one down to the cell corresponding to 
age of unit =3 since no service is again done at year 2. 

This example is rather simplified and was meant to illus- 
trate how the dynamic programming algorithm could be used 
to address equipment maintenance and scheduling problems. 
Actual cases could involve multiple equipment systems, and 
the inclusion of more complex engineering, financial and 
operational characteristics (such as varying cost of electrici- 
ty generation by season and demand) as well as longer time 
horizons. It is in such more elaborate situations that that the 
greater computational efficiency of dynamic programming 
assumes a major consideration. ■ 

Example 7.10.2: Strategy of operating a building to mini- 
mize cooling costs 

This simple example illustrates the use of dynamic pro- 
gramming for problems involving differential equations. The 
concept of lumped models was introduced in Sect. 1.2.3 and 
a thermal network model of heat flow through a wall was 
discussed in Fig. 1 .5. The same concept of a thermal network 
can be extended to predict the dynamic thermal response of 
an entire building. 

Many electric utilities in the U.S. have summer-peaking 
problems, meaning that they are hard pressed to meet the 
demands of their service customers during certain hot after- 
noon periods, referred to as peak periods. The air-conditio- 
ning (AC) use in residences and commercial buildings has 
been shown to be largely responsible for this situation. Re- 
medial solutions undertaken by utilities involve voluntary 
curtailment by customers via incentives or penalties through 
electric rates which vary over time of day and by season 



(called time of day seasonal rates). Engineering solutions, 
also encouraged by utilities, to alleviate this situation include 
installing cool ice storage systems, as well as soft options 
involving dynamic control of the indoor temperature via the 
thermostat. This is achieved by sub-cooling the building du- 
ring the night and early morning, and controlling the thermo- 
stat in a controlled manner during the peak period hours such 
that the "coolth" in the thermal mass of the building structure 
and its furnishings can partially offset the heat loads of the 
building, and hence, reduce the electricity demands of the 
AC. 

Figure 7.14 illustrates a common situation where the 
building is occupied from 6:00 am till 7:00 pm with the peak 
period being from noon till 6:00 pm. The normal operation 
of the building is to set the thermostat to 72°F during the oc- 
cupied period and at 85 °F during unoccupied period (such a 
thermostat set-up scheme is a common energy conservation 
measure). Three different pre-cooling options are shown, 
all three involving the building to be cooled down to 70°F, 
representative of the lower occupant comfort level, starting 
from 6:00 am. The difference in the three options lies in how 
the thermostat is controlled during the peak period. The first 
scheme is to simply set up the thermostat to a value of 78°F 
representative of the high-end occupant discomfort value 
with the anticipation that the internal temperature T will not 
reach this value during the end of the peak period. If it does, 
the AC would come on and partially negate the electricity 
demand benefits which such a control scheme would provi- 
de. Often, the thermal mass in the structure will not be high 
enough for this control scheme to work satisfactorily. Anot- 
her simple control scheme is to let the indoor temperature 
ramp us linearly which is also not optimal. The third, and 
optimal, scheme which would minimize the following cost 
function over the entire day, can be determined by solving 
the dynamic programming problem of this situation over 
time t:^ 



' This is a continuous path optimization problem which can be discreti- 
zed to a finite sum, say a convenient time period of 1 h (the approxima- 
tion improves as the number of tenns in the sum increases). 
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Fig. 7.14 Slcetch sliowing the various operating periods of the build- 
ing discussed in Example 7.10.2, and the three different thermostat 
set-point control strategies for minimizing total electric cost, a Normal 
operation, b Pre-cooling strategies (From Lee and Braun 2008 © Ame- 
rican Society of Heating, Refrigerating and Air-conditioning Engineers, 
Inc., www.ashrae.org) 
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r =min{/} = min |2^c,,.P,(7^,,) 



(7.45a) 



subject to: ?;■,„„„ < T^ < 7;-,max and < f, < Puated 

(7.45b) 

where 

c^j is the unit cost vector of electricity in $/kWh 

(which can assume different values at different 
times of the day) as set by the electric utility, 
P is the electric energy use during hour t, and is 

function of T which changes with time t, 
Cj is the demand cost in $/kW, also set by the elec- 

tric utility, which is usually imposed on the peak 
hourly use during the peak hours (or there could 
be two demand costs, one for off-peak and one 
for on-peak during a given day), and 
max(P|)|i i, is the demand or maximum hourly use during the 
peak period tj to t,. 
The AC power consumed each hour represented by P^ is af- 
fected by T . It cannot be negative and should be less than the 
capacity of the AC denoted by Pg^^^^^- The solution to this dyna- 
mic programming problem requires two thermal models: one 
to represent the thermal response of the building, and another 
for the performance (or efficiency) of the AC. 



R 
-^AAAAA 




TO 



77777 



Fig. 7.15 A simplified IRIC thermal network to model the thermal re- 
sponse of a building (i.e., variation of the indoor temperature T.) subject 
to heat gains from the outdoor temperature T and from internal heat 
gains Q _. The overall resistance and the capacitance of the building are 
R and C respectively and Q^^ is the thermal heat load to be removed by 
the air-conditioner 



Models of varying complexity have been proposed in the 
literature. A simple model following Reddy et al. (1991) is 
adequate to illustrate the approach. Consider the IRlC ther- 
mal network shown in Fig. 7.15 where the thermal mass of 
the building is simplistically represented by one capacitor C 
and an overall resistance R. The internal heat gains Q could 
include both solar heat gains coming through windows as 
well as thermal loads from lights, occupants and equipment 
generated from within the building. The simple one node 
lumped model for this case is: 



C 



dTi T„(t) - Ti 



dt 



R 



Qg(t) - QacU) 



(7.46) 



where T and T are the outdoor and indoor dry-bulb tempera- 
tures, and Q^^ is the thermal cooling provided by the AC. 

For the simplified case when Q and T^ can be assumed 
constant and the AC is switched off, the transient response of 
this dynamic system is given by: 



^ (,min ' iV. {J n 



1 



(7.47a) 



where At is the time from when the AC is switched off. The 
time required for T to increase from T to T is then: 

^ 1 i.min i.max 



At — —X In 



1 



T 



T — T 



R.Q 



« J 



(7.47b) 



where r is the time constant given by (C.R). 

The savings in thermal cooling energy AQ^^ which can be 
avoided by the linear ramp-up strategy can also be determi- 
ned in a straightforward manner by performing hourly calcu- 
lations over the peak period using Eq. 7.46 since the T values 
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Table 7.3 Peak AC power reduction and daily energy savings compared to base 
October. (From Lee and Braun 2008 © American Society of Heating, Refrigeratin; 


operation under different control strategies for similar days in 
g and Air-Conditioning Engineers, Inc., www.ashrae.org) 


Test# 


T 

oul,max 

°C 




Control 

Sti-ategy* 


Peak power 
kW 




Peak savings 
kW 


Energy 
kWh 


use 




Energy savings 
kWh 


1 


32.2 




NS (baseline) 


26.10 




- 




243.1 






- 


2 


31.7 




LR 


23.53 




2.57 




226.5 






16.6 


3 


32.8 




SU 


20.52 




5.58 




194.2 






20.1 


4 


29.4 




NS (baseline) 


29.70 




- 




224.3 






- 


5 


30.6 




DL 


20.03 




9.67 




219.2 






5.2 


6 


30.6 




DL 


22.34 




7.36 




196.6 






27.8 


7 


26.7 




NS (baseline) 


27.04 




- 




233.4 






- 


8 


26.7 




DL 


16.94 




10.10 




190.4 






43.0 



*NS baseline operation, LR linear ramp-up, SU setup, DL demand limiting 



can be determined in advance. The total thermal cooling ener- 
gy saved during the peak period is easily deduced as: 



AG 



AC 



c. 



Ti, 



Ti,. 



At 



(7.48) 



peak 



where Af ^^^^ is the duration of the peak period. 

For the dynamic optimal control strategy, the rise in T(t) 
over the peak period has to be determined. For the sake of 
simplification, let us discretize the continuous path into say, 
hourly increments. Then, Eq. 7.46 with some re-arrangement 
and minor notational change, can be expressed as: 



iAC,t 



+ 



C.Tij+i + Tij I C 
To,, 



R 



Qg,r 



(7.49a) 



model parameters have been deduced from experimental tes- 
ting of the building and the AC equipment which were then 
used to evaluate different thermostat control options. The ta- 
ble assembles daily energy use and peak AC data for baseline 
operation (NS) against which other schemes can be compa- 
red. Since the energy and demand reductions would depend 
on the outdoor temperature, the tests have been assembled 
for three different conditions (tests 1-3 for very hot days, 
tests 4-6 for milder days, and tests 7-8 for even milder days. 
The optimal strategy found by dynamic programming is cle- 
arly advantageous both in demand reduction and in diurnal 
energy savings although the benefits show a certain amount 
of variability from day-to-day. This could be because of di- 
urnal differences in the driving functions and also because of 
uncertainty in the determination of the model parameters. ■ 



while the electric power drawn by the AC can be modeled 

as : 

P, ^ .f{QAC,Ti,To) (7.49b) 

subject to conditions Eq. 7.45b. 

The above problem can be solved by framing it as one 
with several stages (each stage corresponding to an hour into 
the peak period) and states representing the discretized set of 
possible values of T (say in steps of 0.5°F). One would get 
a set of simultaneous equations (the order being equal to the 
number of stages) which could be solved together to yield the 
optimal trajectory. Though this is conceptually appealing, it 
would be simpler to perform the computation using a soft- 
ware package given the non-linear function of P (Eq. 7.49b) 

and the need to introduce constraints on P and T. 

1 

The example in Sect. 7.8 assumed polynomial models 
though physical models using grey box models (see Pr. 5.13) 
are equally appropriate. Table 7.3 assembles peak AC power 
savings and daily energy savings for a small test building lo- 
cated in Palm Desert, CA which was modeled using higher 
order differential equations by Lee and Braun (2008). The 



Problems 

Pr. 7.1' The collector area A^, of a solar thermal system is to 
be determined which minimizes the total discounted savings 
C^' over n years. The solar system delivers thermal energy 
to an industrial process with the deficit being met by a con- 
ventional boiler system (this configuration is referred to as 
a solar-supplemented thermal system). Given the following 
expression for discounted savings: 



C' =87,875.66 



17,738.08 



1 — exp 



190.45 



- 2000 - 300Ac 



determine the optimal value of collector area A,^ Verify your 
solution graphically, and estimate a satisficing range of col- 
lector area values which are within 5% of the optimal value 
of C'. 



From Reddy (1987). 



Problems 
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Fig. 7.16 Sketch of a parabolic 
trough solar power system. The 
energy collected from the collec- 
tors can be stored in the thermal 
storage tanks which is used to 
produce steam to operate a Ran- 
kine power engine. (Downloaded 
from http://wwwl.eere. energy, 
gov/solar/) 



Steam condenser 




Receiver 



Generator 



Turbine 



Parabolic Troughs 



Pr. 7.2 

(a) Use a graphical approach to determine the values of x^ 
and X, which maximize the following function: 

J = 0.4x1 + 0.5x2 
s.t. 0.3x1 +0.1x2 < 2.7 

0.5x1 + 0.5x2 = 6 

0.6x1 + 0.4x2 > 6 
and xi > and X2 > 

(b) Solve the problem analytically using the slack variable 
approach, and verify your results. 

(c) Perform a sensitivity analysis of the optimum 

Pr. 7.3 Use the Lagrange multiplier approach to minimize 
the following constrained optimization problem: 



y = (xf + X2 + xl)/2 



s.t. Xi 



■ X2 



analogous to Eq. 5.65 for concentrating collectors with con- 
centration ratio C is: 



rjc = 



FrVo,,! 



FrUl (Tci-T, 



C 



0, X] + X2 + X3 =0 



The efficiency of the solar system decreases with higher 
values of T while that of the Rankine power cycle increa- 
ses with higher value of T^^ such that the optimal operating 
point is the product of both efficiency curves (Fig. 7.17). 
The problem is to determine the optimal value of T which 
maximizes the overall system efficiency given the following 
information: 

(i) concentration ratio C = 30 
(ii) beam solar irradiation I.j. = 800 W/m^ 
(iii) trough collectors with: Fji =0.75 and FJJ = 

10.0W/m^°C 
(iv) ambient temperature T = 20°C 
(v) Assume that the Rankine efficiency of the steam power 

cycle is half of that of the Carnot cycle operating bet- 



Pr. 7.4 Solar thermal power system optimization 
Generating electricity via thermal energy collected from 
solar collectors is a mature technology which is much 
cheaper than that from photovoltaic systems (if no rebates 
and financial incentives are considered). Such solar ther- 
mal power systems in essence comprise of a solar collector 
field, a thermal storage, a heat exchanger to transfer the 
heat collected from the solar collectors to a steam boiler and 
the conventional Rankine steam power plant (Fig. 7.16). A 
simper system without storage tank will be analyzed here 
such that the fluid temperature leaving the solar collector 
array (T^^) will directly enter the Rankine engine to produce 
electricity. 

Problem 5.6 described the performance model of a flat- 
plate solar collector while usually concentrating collectors 
are required for power generation. A simplified expression 
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Fig. 7.17 Combined solar collector and Rankine engine efficiencies 
dictate the optimum operating temperature of the system 
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Fig. 7.18 Ducting layout with pertinent information for Problem Pr. 

7.5 



ween the same high and low temperatures (take the low- 
end temperature to be 10°C above T and the high tem- 
perature is to be determined from an energy balance on 

Qc 



the solar collector: T^o — T^ 



-where Q is the 



useful energy collected per unit collector area (W/m-), 

the mass flow rate per unit collector area m=8 kg/m^ h, 

and the specific heat of the heat transfer fluid c =3.26 

kJ/kg°C). 

Solve this unconstrained problem analytically and verify 

your result graphically. Perform a post optimality analysis and 

identify influential variables and parameters in this problem. 

Pr. 7.5' Minimizing pressure drop in ducts 
Using the method of Lagrange multipliers, determine the dia- 
meters of the circular duct in the system shown in Fig. 7.18 
so that the drop in the static pressure between points A and 
B will be a minimum. Use the following additional informa- 
tion: 

- Quantity of sheet metal available = 60 m^ 

- Pressure drop in a section of straight duct of diameter D 
(m) and length L (m) with fluid flowing at velocity v (m/s), 

where f is the friction factor =0.02 and air density p = 
1.2kg/m' 

Neglect pressure drop in the straight-through section past 
the outlets and the influence of changes in velocity pressure. 
Use pertinent information from Fig. 7.18. 

Pr. 7.6 Replacement of filters in HVAC ducts 
The HVAC air supply in hospitals has to meet high standards 
in terms of biological and dust-free cleanliness for which 
purpose high quality filters are used in the air ducts. Fouling 
by way of dust build-up on these filters causes additional 
pressure drop which translates into an increased electricity 
consumption of the fan-motor circulating the air. Hence, the 
maintenance staff is supposed to replace these filters on a 
regular basis. Changing them too frequently results in undue 



expense due to the high cost of these filters, while not re- 
placing them in a timely manner also increases the expense 
due to that associated with the pumping power. Determine 
the optimal filter replacement schedule under the following 
operating conditions (neglecting time value of money): 

- The HVAC system operates 24 h/day and 7 days/week 
and circulates Q= 100 mVs of air 

- The pressure drop in the HVAC duct when the filters are 
new is 5 cm of water or H = 0.05 m 

- The pressure drop across the filters increase in a linear 
fashion by 0.1 m of water gauge for every 1000 h of ope- 
ration (this is a simplification — actual increase is likely to 
be exponential) 

- The total cost of replacing all the filters is $ 800 

- The efficiency of the pump is 65% and that of the motor 
is 90% 

- The levelized cost of electricity is $ 0. 10 per kW h. 

The electric power consumed in kW by the pump-motor 

. . Q(L/s)H{m) 
is given by: t(kW) — 

1, t ^■^)^ pump ^ motor 

(Hint: The problem is better visualized by plotting the 
energy cost function versus hours of operation) 

Pr. 7.7 Relative loading of two chillers 
Thermal performance models for chillers have been descri- 
bed in Pr. 5.13 of Chap. 5. The black box model given by 
Eq. 5.75 for the COP often appears in a modified form with 
the chiller electric power consumption P being expressed as: 

P = flo + ai(Tcdo - Tchi) + aiiTcdo - Tchif + 03 Qch 



+ «4Gc/i + '^siTcdo — Tchi)Qch 



(7.50) 



where T ^ and T ^ are the leaving condenser water and sup- 

cdo chi *^ ^ 

ply chilled water temperatures respectively, Q^|^ is the chiller 
thermal load, and a. are the model parameters. Consider a si- 
tuation where two chillers, denoted by Chiller A and Chiller 
B, are available to meet a thermal cooling load. The chil- 
lers are to be operated such that T ^ =85°F and T ^ =45°F. 

^ cdo chi 

Chiller B is more efficient than Chiller A at low relative load 
fractions and vice versa. Their model coefficients from per- 
formance data supplied by the chiller manufacturer are given 
in Table 7.4. 

(a) The loading fraction of a chiller is the ratio of the actual 
thermal load supplied by the chiller to its rated value Q^j^ ^^^^. 
Use the method of Lagrange multipliers to prove that the 
optimum loading fractions y^* and y^* occur when the slo- 

BPa dPB 



pes of the curves are equal, i.e., when 



dQch,A 9 go 



h,B 



' From Stoecker (1989) by permission of McGraw-Hill. 



(b) Determine the optimal loading (which minimizes the to- 
tal power draw) of both chillers at three different values 
of Q ■ 800, 1200, 1600 tons, and calculate the corre- 
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Table 7.4 Values of coefficients in Eq. 7.50. (from ASHRAE 1999 
G American Society of Heating, Refrigerating and Air-Conditioning 
Engineers, Inc., www.ashrae.org) 





Units 


ChiUerA 


Chiller B 


^i;h-ralctl 


Tons (cooling) 


1250 


550 


^0 


kW 


106.4 


119.7 


ai 


kW/°F 


6.147 


0.1875 


^ 


kW/°P 


0.1792 


0.04789 


83 


kW/ton 


-0.0735 


-0.3673 


^ 


kW/ton^ 


0.0001324 


0.0005324 


a. 


kW/ton.°F 


-0.001009 


0.008526 



spending power draw. Investigate the effect of near-op- 
timal operation, and plot your results in a fashion useful 
for the operator of this cooling plant. 



will compare these three strategies for the following small 
commercial building and specified control limits: 

Assume that the RC network shown in Fig. 7.15 is a good 
representation of the actual building. The building time cons- 
tant is 6 h and its overall heat transfer resistance R=2.5°C/ 
kW. The internal loads of space can be assumed constant at 
Q =1.5 kW. The peak period lasts for 8 h and the ambient 
temperature can be assume constant at Tu=32°C. The mini- 
mum and maximum thermostat control set points are T 
= 22°CandT =28°C. 

I, max 

A very simple model for the AC is to be used (Kreider et 
al. 2009): 



QaC, Rated 
C O PRared 



(0.023 + IA29*PLR - OAlTPLR^) 

(7.51) 



Pr. 7.8' Maintenance scheduling for plant 
The maintenance schedule for a plant is to be planned to ma- 
ximize its 4-year profit. The income level of the plant at any 
given year is a function of the condition of the plant carried 
over from the previous year and the maintenance expenditure 
at the beginning of the year. Table 7.5 shows the necessary 
maintenance expenditures that result in a certain income level 
during the year for various income levels carried over from 
the previous year. The income level at the beginning of year 
1 before the maintenance expenses are made is $ 36,000 and 
the income level specified during, and at the end of year 4, is 
to be $ 34,000. The profit for any one year will be the income 
during the year minus the expenditure made for maintenance 
at the beginning of the year. Use dynamic programming to 
determine the plan for maintenance expenditures that result 
in maximum profit for the 4 years. 

Pr. 7.9 Comparing different thermostat control strategies 
The three thermostat pre-cooling strategies whereby air-con- 
ditioner (AC) electrical energy use in commercial buildings 
can be reduced have been discussed in Example 7.10.2. You 



Table 7.5 Maintenance expenditures made at the beginning of year, 
thousands of dollars 



Income level 


Income level 


during year 






carried over from 
previous years 


$30 


$32 


$34 


$36 


$38 


$40 


$30 


$2 


$4 


$7 


$11 


$16 


$23 


32 


2 


3 


5 


9 


13 


18 


34 


1 


2 


4 


7 


10 


14 


36 





1 


2 


5 


8 


10 


38 


X 





1 


2 


6 


9 


40 


X 


X 





1 


4 


8 



From Stoecker (1989) by permission of McGraw-Hill. 



Where PLR=part load ration =(Q^^/Q^^^^,^^) and COP,^,^, is 
the Coefficient of Performance of the reciprocating chiller. 
Assume COP„ , =4.0 and Q,^„ , ,=4.0 kW. 

Rated ^-AC.Rated 

(a) Usually, T is not the set point temperature of the space. 
For pre-cooling strategies to work better, a higher tempe- 
rature is often selected since occupants can tolerate this 
increased temperature for a couple of hours without ad- 
verse effects. Calculate the AC electricity consumed, as 
well as the demand, during the peak period if T^=26°C; 
this will serve as the baseline electricity consumption 
scenario. 

(b) For the simple-minded setup strategy, first compute the 
number of hours in which the selected T is reached 

i.max 

Starting from T . Then, calculate the electricity con- 
sumed and the demand by the AC during the remaining 
number of hours left in the peak period if the space is 
kept at T 

*- I, max 

(c) For the ramp-up strategy, calculate electricity consumed 
and the demand by the AC during the peak period (by 
summing those at hourly time intervals). 

(d) Assuming that the thermostat is being controlled at hour- 
ly intervals, determine the optimal trajectory of the in- 
door temperature? What is the corresponding AC electri- 
city use and demand? 

(e) Summarize your results in a table similar to Table 7.3 
and discuss your findings. 
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Classification and Clustering Methods 



8 



This chapter covers two widely used classes of multivaria- 
te data analysis methods, classification and clustering met- 
hods. Classification methods are meant: (i) to statistically 
distinguish or "discriminate" between differences in two or 
more groups when one knows beforehand that such grou- 
pings exist in the data set of measurements provided, and 
(ii) subsequently assign or allocate a future unclassified ob- 
servation into a specific group with the smallest misclassi- 
fication error. Numerous classification techniques, divided 
into three groups: parametric, heuristic and regression trees, 
are described and illustrated by way of examples. Clustering 
involves situations when the number of clusters or groups is 
not known beforehand, and the intent is to allocate a set of 
observation sets into groups which are similar or "close" to 
one another with respect to certain attribute(s) or characte- 
ristic(s). In general, the number of clusters is not predefined 
and has to be gleaned from the data set. This and the fact that 
one does not have a training data set to build a model make 
clustering a much more difficult problem than classification. 
Two types of clustering techniques, namely partitional and 
hierarchical, are described. This chapter provides a non-mat- 
hematical overview of these numerous techniques with con- 
ceptual understanding enhanced by way of simple examples 
as well as actual case study examples. 



8.1 Introduction 

Certain topics relevant to multivariate data analysis have 
been previously treated: ANOVA analysis (Sect. 4.3), tests 
of significance (Sect. 4.4), multiple OLS regression without 
(Sect. 5.4) and with collinear regressors (Sect. 10.3). This 
chapter complements these by covering two important statis- 
tical classes of problems dealing with multivariate data ana- 
lysis, namely classification and clustering methods. 

Clustering analysis involves several procedures by which 
a group of samples (or multivariate observations) can be 
clustered or partitioned or separated into sub-sets of grea- 
ter homogeneity, i.e., those based on some pre-determined 
similarity criteria. Examples include clustering individuals 



based on their similarities with respect to physical attribu- 
tes or mental attitudes or medical problems; or multivariate 
performance data of a mechanical piece of equipment can 
be separated into those which represent normal operation 
as against faulty operation. Thus, clustering analysis reve- 
als inter-relationships between samples which can serve to 
group them under situations where one does not know the 
number of sub-groups beforehand. There are a large num- 
ber of methods which have been developed, and some of the 
classical ones are described in Sect. 8.5. As a note of caution, 
certain authors (for example, Chatfield 1995) opine that the 
clustering results depend to a large extent on the clustering 
method used, and that inexperienced users are very likely to 
end up with misleading results. This lack of clear-cut results 
has somewhat dampened the use of sophisticated clustering 
methods, with analysts tending to rely rather on simpler and 
more intuitive methods. 

Classification analysis, on the other hand, applies to si- 
tuations when the groups are known beforehand. The pur- 
view here is to identify models which best characterize the 
boundaries between groups, so that future objects can be all- 
ocated into the appropriate group. Since the groups are pre- 
determined, classification problems are somewhat simpler to 
analyze than clustering problems. The challenge in classifi- 
cation modeling is dealing with the misclassification rate of 
objects, a dilemma faced in most practical situations. Three 
types of classification methods are briefly treated: parame- 
tric methods (involving statistical, ordinary least squares, di- 
scriminant analysis and Bayesian techniques), heuristic clas- 
sification methods (rule-based, descision-tree and k nearest 
neighbors), and parametric models involving classification 
and regression trees. 



8.2 Parametric Classification Approaches 

8.2.1 Distance Between Measurements 

Similarity between objects or samples can be geometrically 
likened to a distance or a trend. The correlation coefficient 
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between two variables can be used as a measure of trend and 
similarity. The distinctiveness of two objects can be visually 
ascertained by plotting them. For example, the Euclidian dis- 
tance between two objects (x^, y^) and (x,, y,) plotted on Car- 
tesian coordinates is characterized in two-dimensions by: 



Table 8.2 Standardized measurements 



d ^ [{X2 - xiY + (y2 - yiY] 



2-|l/2 



(8.1) 



In general, for p variables, the generalized Euclidean dis- 
tance from object i to object j is (Manly 2005): 



dij = 



y^ (xik - Xjkf 



.k=\ 



1/2 



(8.2) 



where Xit is the value of the variable Xj. for object i and Xjk 
is the value of the same variable for object j. 

The distance term will be affected by the magnitude of 
the variables, i.e., physical quantities such as temperature, air 
flow rates, efficiency, have different scales and range of va- 
riation, and the one with largest numerical values will overw- 
helm the variation in the others. Thus, some sort of normal- 
ization is warranted. One common approach is to equalize 
the variances by defining a new variable (x^/Sj) where s^- is 
an estimate of the variance of variable Xj. Other ways are by 
min-max scaling or by standard deviation scaling as given by 
Eqs. 3.12 and 3.13. 

Example 8.2.1 : Using distance measures for evaluating ca- 
nine samples 

Consider a problem where a biologist wishes to evaluate 
whether the modern dog in Thailand descended from prehis- 
toric ones from the same region or were inbred with similar 
dogs which migrated from nearby China or India. The basis 
of this evaluation will be six measurements all related to the 
mandible or lower jaw of the dog: x^ — breadth of mandible, 
x^ — height of mandible below the first molar, X-, — length of 
first molar, x^ — breadth of first molar, x^ — length from first 
to third molar, x^^ — length from first to fourth premolar. Re- 
levant data are assembled in Table 8.1. 

The measurements have to be standardized, and so, the 
mean and standard deviations for each variable across groups 
is determined (shown in Table 8.1). This allows the standard- 



Table 8.1 Mean mandible (or lower jaw) measurements of six va- 
riables for four canine groups. (Modified example from Higham et al. 
1980) 





\ 


''2 


^3 


\ 


^, 


\ 


Modem dog 


9.7 


21 


19.4 


7.7 


32 


36.5 


Chinese wolf 


13.5 


27.3 


26.8 


10.6 


41.9 


48.1 


Indian wolf 


11.5 


24.3 


24.5 


9.3 


40 


44.6 


Prehistoric dog 


10.3 


22.1 


19.1 


8.1 


32.2 


35 


Mean 


11.25 


23.675 


22.45 


8.925 


36525 


41.05 


Standard 
deviation 


1.68 


2.78 


3.81 


1.31 


5.17 


631 





^1 ^2 h 


Z4 ^5 


h 


Modern dog 


-0.922 -0.963 -0.800 


-0.937 -0.875 


-0.721 


Chinese wolf 


1.342 1.304 1.140 


1.281 1.040 


1.117 


Indian wolf 


0.149 0.225 0.537 


0.287 0.672 


0.562 


Prehistoric dog 


-0.567 -0.567 -0.878 


-0.631 -0.837 


-0.958 


Table 8.3 Euclidean distances between canine groups 




Modem dog Chinese 
wolf 


Indian Prehistoric 
wolf dog 


Modem dog 


- 






Chinese wolf 


5.100 






Indian wolf 


3.145 2.094 


- 




Prehistoric dog 


0.665 4.765 


2.928 





ized values to be computed following Eq. 3.13 as shown in 
Table 8.2. For example, the standardized value for the mo- 
dern dog: z, =(9.7 - 1 1 .25)/l .68= -0.922, and so on. 

Finally, using Eq. 8.2, the Euclidean distances among all 
groups are computed as shown in Table 8.3. It is clear that 
prehistoric dogs are similar to modern ones because their dis- 
tances are much smaller than those of others. Next in terms 
of similarity are the Chinese and Indian wolfs, and so on. ■ 

Other measures of distance have been proposed for di- 
scrimination; one such measure is the Manhattan measure 
which uses absolute differences rather than squared distan- 
ces. However, this is used only under special circumstances. 
A widely used distance measure is the Mahanabolis measure 
which is superior to the Euclidean when the variables are 
correlated. If x| — (x\i,X2i,. ■ ■ jXpi)' denotes the vector of 
mean values for the i* group, and C the variance-covariance 
matrix given by Eq. 5.31 in Sect. 5.4.2, then, the Mahanabo- 
lis distance from an observation x' to the center of group i is 
computed as: 



A^ 



(x - Xi) C-\x - Xi) 



(8.3) 



The observation x is then assigned to the group for which 
the value of D. is smallest. 



8.2.2 Statistical Classification 

As stated earlier, classification methods provide the necessa- 
ry methodology to: (i) statistically distinguish or "discrimi- 
nate" between differences in two or more groups when one 
knows beforehand that such groupings exist in the data set of 
measurements provided, and (ii) subsequently assign or all- 
ocate a future unclassified observation into a specific group 
with the smallest probability of error. 

At the onset, a conceptual understanding of the general 
classification problem is useful. Consider the simplest clas- 
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Fig. 8.1 Errors of misclassifica- 
tion for the univariate case. 
a When the two distributions are 
similar and if equal misclassifi- 
cation rates are sought, then the 
cutoff value or score is selected 
at the intersection point of the 
two distributions, b When the 
two distributions are not similar, 
a similar cut-off value will result 
in different misclassification 



Cut-off value 



Group B 




Group B objects 
misclassified 
into Group A 



Group A objects 

misclassified 

into Group B 



Cut-off value 



Group B 




Measured Variable 



sification problem where two groups (Group A and Group B) 
are to be distinguished based on a single variable x. Both the 
groups have the same variance but the means are different. If 
the variable x is plotted on a single axis for both groups, the 
trivial problem is when there is no overlap. In this situation, 
determining the threshold or boundary or cut-off score is ob- 
vious. On the other hand, one could obtain a certain amount 
of overlap as shown in Fig. 8.1a. The objective of classifi- 
cation in this univariate instance is to determine the cut-off 
value of X which would yield fewest errors of false classifi- 
cation. If the two groups have equal variance in the variable 
X, then the best cut-off value is the point of intersection of the 
two distributions (as shown in Fig. 8.1a). Note that the areas 
represented by the tails of the distributions on either side of 
the cut-off value represent the misclassification probabili- 
ties or rates. An obvious extension to this simple problem is 
when the two distributions are different. There is no obvious 
best cut-off value since any choice of cut-off value would re- 
sult in different misclassification error rates for both groups 
(see Fig. 8.1b). In such a case, the choice of the cut-off va- 
lue is dictated by other issues such as the extent to which 
misclassification of one group is more critical (i.e., errors in 



one group have more severe adverse implications in cost, for 
example) than that of the other. The following example will 
illustrate the general approach for the univariate case. 

Example 8.2.2: Statistical classification of ordinary and 
energy efficient office buildings 

The objective is to distinguish between medium-sized of- 
fice buildings which are ordinary (type O) or energy efficient 
(type E) judged by their "energy use index" (or EUI) or the 
energy used per square foot per year. Table 8.4 lists the EUIs 
for 14 buildings in the Phoenix, AZ area, half of which are 
type O and the other half type E. The first ten values (Cl- 
ClO) will be used to train or develop the cut-off score, while 
the last four will be used for testing, i.e., in order to determi- 
ne the misclassification rate more representative of future 
misclassification rates than the one found during training. 

This is a simple example meant for conceptual unders- 
tanding. If the cut-off value is to be determined with no spe- 
cial weighting on misclassification rates, a first attempt at 
determining this value is to take it as being the mid-point of 
both the means. The training data set consists of five values 
in each category. The average value for the first five build- 
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Table 8.4 Data table specifying type and associated EUI and the results 
of the classification analysis (only misclassified ones are indicated) 



x2 



Building # 


Type 


EUI 

(kBtu/ftVyr) 




Assuming cut-off 
score of 38.2 


CI 


O 


40.1 


Training 




C2 


O 


41.4 


Training 




C3 


o 


38.7 


Training 




C4 


o 


37.5 


Training 


Misclassified 


C5 


o 


43.0 


Training 




C6 


E 


37.4 


Training 




C7 


E 


38.3 


Training 


Misclassified 


C8 


E 


36.9 


Training 




C9 


E 


35.3 


Training 




CIO 


E 


36.1 


Training 




cu 


O 


37.2 


Testing 


Misclassified 


C12 


O 


39.2 


Testing 




C13 


E 


37.2 


Testing 




C14 


E 


36.3 


Testing 





ings (C1-C5 are type O) is 39.7 while that of the second five 
(C6-C10 are type E) is 36.8. If a cut-off value of 38.2 which 
is the mid-point or mean of the two averages is selected, one 
should expect the EUI for ordinary buildings to be higher 
than this value and that for energy efficient ones to be lower. 
The results listed in the last column of Table 8.4 indicate one 
misclassification in each category during the training phase. 
Thus, this simple-minded cutoff value is acceptable since it 
leads to equal misclassification rates among both categories. 
The table also indicates that among the last four buildings 
(CI 1-C14) used for testing the selected cut-off score, one of 
the ordinary buildings is improperly classified. 

It is left to the reader to extend this type of analysis to the 
case when the misclassification rates are stipulated as not 
being equal. A case in point would be to deduce the cut-off 
value where no misclassification for category O is allowed. 
A cut-off value of 37.2 would fit this criterion. ■ 

How classification is done using the simple distance mea- 
sure is conceptually illustrated in Fig. 8.1. The extension of 
the distance measure to bivariate (and multivariate) cases is 
intuitively straightforward. Two groups are shown in Fig. 8.2 
with normalized variables. During training, the centers of each 
class can be determined as well as the separating boundaries 
around them. The two circles (shown continuous) encircle 
all the points. A future observation will be classified in that 
group whose class center is closest to that group. However, 
a deficiency is that some of the points are misclassified. One 
solution to decreasing, but not eliminating, the misclassifica- 
tion rates is to reduce the boundaries (shown by dotted circ- 
les). However, some of the observations now fall outside the 
dotted circles, and these points cannot be classified into either 
Class A or Class B. Hence, the option of reducing the boun- 
dary diameters is acceptable only if one is willing to group 
certain points into a third class called "unable to classify". 
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Fig. 8.2 Bivariate classification using the distance approach showing 
the centers and boundai'ies of the two groups. Data points are assigned 
to the class whose center (C^ or C^) is closest to the point of interest. 
Note that reducing the boundaries (from continuous to dotted circles) 
reduces the misclassification rates but then some of the points fall out- 
side the boundaries, and hence the need to be classified into additional 
groups 

8.2.3 Ordinary Least Squares Regression 
Method 

The logical ultimate extension of the univariate two-group 
situation is the multivariate multi-group case where the clas- 
sification is based on a number of measurements or variables 
characterizing different attributes of each group. Instead of 
a single cut-off value, the multivariate case would require 
several functions to "separate" the groups. 

Regression methods (discussed in Chap. 5) were traditio- 
nally developed to deal with model building problems and 
prediction. They can also be used to model differences bet- 
ween groups and thereby assign a future observation to a par- 
ticular group. While the response variable during regression 
is a continuous one, that relevant to classification is a catego- 
rical variable representing the class from which the regressor 
set was gathered. The type of variables best suited for model 
building is continuous. In the case of classification models, the 
attributes or regressor variables can be categorical or continu- 
ous, but the response variable has to be categorical. Several 
regression methods have been described in previous chapters 
such as least square linear multivariate models, least square 
nonlinear models, maximum likelihood estimation (MLE) 
and neural network based multi-layer perceptron (MLP). All 
these methods can also be used in the classification context 
by simply assigning arbitrary numerical values to the diffe- 
rent classes (called coding) which will serve as the dependent 
variable while training the classification model. 

Let us consider the case of distinguishing between two 
groups A and B based on measurements of two predictor va- 
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Fig. 8.3 Classification involves identifying a separating boundaiy bet- 
ween two known groups (shown as dots and crosses), based on two 
attributes (or variables or regressors) X| and x„ which will minimize the 
misclassification of the points 

riables x^ and x^ (Fig. 8.3). The groups overlap and the two 
variables are moderately correlated; somewhat complicating 
the separation. One can simply use these two measurements 
as the regressor variables, and arbitrarily assign, say and 1 
to Group A and Group B respectively. Ordinary least-square 
(OLS) regression will directly yield the necessary model for 
the boundary between the two sets of data points. This model 
can be used to predict (or assign or classify) a future set of 
measurements into either Group A or Group B depending on 
whether the data point falls above or below the model line 
respectively. Just like the indices R^ or RMSE are used to 
evaluate the goodness of fit of a regression model, the accu- 
racy of classification (or the misclassification rate) of all the 
data points used to identify the regression model can serve as 
an indicator of the performance of the classification model. A 
better approach is to adopt the sample hold-out cross-valida- 
tion approach (described in Sect. 5.3.2) where a portion of the 
data is used for training, and the rest for testing or evaluating 
the identified model. The corresponding indices are likely to 
be more representative of the actual model performance. 

Figure 8.4a illustrates an example of a linear decision 
boundary for a set of data points from three known groups 
or classes. While separation of two classes needed a single 
linear model, two linear models are needed to separate three 
classes, and so on. The boundaries need not be linear, piece- 
wise linear models could also be identified (Fig. 8.4b). An 
approach similar to using indicator variables in OLS regress- 
ion can also be adopted which allows the flexibility of captu- 
ring local piece-wise behavior. An extension of such models 
to non-linear boundaries in more complex situations leads 
to multi-layer perceptron artificial neural network (ANN) 
models described in Sect. 11.3.3. However, it will be shown 
in the next section that OLS models are not really meant to 
be used for classification problems though they can provide 
good results in some cases. A better modeling approach is 
described in the next section. 
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Fig. 8.4 Linear and piecewise linear decision boundaries for two-di- 
mensional data (color intensity and alcohol content) used to classify 
the type of wine into one of three classes, a Linear decision boundaries. 
b Piecewise linear decision boundaries. (From Hair et al. 1998 by © 
permission of Pearson Education) 



8.2.4 Discriminant Function Analysis 

Linear discriminant function analysis (LDA), originally pro- 
posed by Fisher in 1936, is similar to multiple linear regress- 
ion analysis but approaches the problem differently. The si- 
milarity lies in that both approaches are based on identifying 
a linear model from a set of p observed quantitative variables 
X, such as z — Wo + wixi + W2X2 + ■ ■ ■ WpXp where z is cal- 
led the discriminant score and w are the model weights. A 
model such as this allows multivariate observations X. to be 

converted into univariate observations z . However, the de- 

1 ' 

termination of the weights, which are similar to the model 
parameters during regression, is done differently. LDA seeks 
to determine weights that maximize the ratio of between- 
class scatter to the within-class scatter, or 
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Max 



squared distance between means of z 



Variance of z 



(Ml - M2) 



(8.4) 



where fi^ and ^^ are the average values of z. for Group A and 
Group B respectively, and the variance of z is that of any one 
group, with the assumption that the variances of both groups 
are equal. 

Just like in OLS regression, the loss function need not ne- 
cessarily be one that penalizes squared differences, but this is 
a form which is widely adopted. This approach allows two or 
more groups of data to be distinguished as best as possible. 
More weight is given to those regressors that discriminate 
well and vice versa. The method is optimal when the two 
classes are normally distributed with equal covariance ma- 
trices; even when they are not, the method is said to give 
satisfactory results. 

The model, once identified, can be used for discrimina- 
tion, i.e., to classify new observations as belonging to one or 
another group (in case of two groups only). This is done by 
determining the threshold or the separating score, with new 
objects having scores larger than this score being assigned to 
one class and those with lower scores assigned to the other 
class. If z^ and Zg are the mean discriminant scores of prec- 
lassified samples from groups A and B, the optimal choice 
for the threshold score z ^ when the two classes are of equal 

thres ^ 

size, are distributed with similar variance and for equal misc- 
lassification rates z,/,re.5 — (z^ +zg)/2. A new sample will 
be classified to one group or another depending on whether 
z is larger than or less than z^^^^^^. Misclassification errors per- 
taining to one class can be reduced by appropriate weighting 
if the resulting consequences are more severe in one group as 
compared to the other. 

Several studies use standardized regressors (with zero 
mean and unit variance, as usually done in principle com- 
ponent analysis or PCA) to identify the discriminant func- 
tion. Others argue that an advantage of the LDA is that data 
need not be standardized since results are not affected by 
scaling of the individual variables. One difference is that, 
when using standardized variables, the discriminant function 
would not need an intercept term, while this is needed when 
using untransformed variables. 

In constructing the discriminant functions under a multi- 
variate case, one can include all of the regressor variables 
or adopt a stepwise selection procedure that includes only 
those variables that are statistically significant discrimina- 
tors amongst the groups. A number of statistical software 
are available which perform such stepwise procedures and 
provide useful summaries and tests of significance for the 
number of discriminant functions. If the dependent variable 
is dichotomous (i.e., can only assume two values), there is 
only one discriminant function. If there are k levels or ca- 



tegories, upto (k-1) functions can be extracted. Just like in 
PCA (Sect. 10.3.2), successive discriminant functions are 
orthogonal to one another and one can test or determine 
how many are worth extracting. The interested reader can 
refer to pertinent texts such as Manly (2005), Hand (1981) 
or Duda et al. (2001) for a mathematical treatment of LDA, 
for establishing statistical significance of group differen- 
ces, and for more robust methods such as quadratic discri- 
minant analysis. 

Though LDA is widely used for classification problems, 
it is increasingly being replaced by logistic regression (see 
Sect. 10.4.4) since the latter makes fewer assumptions, and 
hence, is more flexible (for example, discriminant analysis 
is based on normally distributed variables), and more robust 
statistically when dealing with actual data. Logistic regress- 
ion is also said to be more parsimonious and the value of the 
weights easier to interpret. A drawback (if it is one!) is that 
logistic regression requires model weights to be estimated 
by maximum likelihood method (Sect. 10.4.3), which some 
analysts are not as familiar with as OLS regression. 

Example 8.2.3:' Using discriminant analysis to model 
fault-free and faulty behavior of chillers 

This example illustrates the use of multiple linear regress- 
ion and linear discrimination analysis to classify two data 
sets representative of normal and faulty operation of a large 
centrifugal chiller. The faulty operation corresponds to chil- 
ler performance in which non-condensable (namely nitrogen 
gas) was intentionally introduced in the refrigerant loop. The 
three discerning variables are T ^ ^ is the amount of refrige- 
rant subcooling in the condenser (°C), T ^ is the condenser 
approach temperature (°C) i.e., difference in condenser refri- 
gerant temperature and that of the cooling water leaving the 
condenser, and COP is the coefficient of performance of the 
chiller. The data is assembled in Table 8.5 and shows the as- 
signed grouping code (0 for fault-free and 1 for faulty), and 
the numerical values for the three regressors. The two data 
sets consist of 27 operating points each; however, 21 points 
are used for training the model while the remaining points 
are used for evaluating the classification model. 

The 3-D scatter plot of the three variables used for model 
training is shown in Fig. 8.5 with the two groups distinguis- 
hed by different symbols. One notes that the two groups are 
fairly distinct and that no misclassification should occur (in 
short, not a very challenging problem). 

The following models were identified (with all coeffi- 
cients being statistically significant): 

• OLS regression model z = 1.03467 - 0.33752 * COP - 
0.3446 * T , , H- 0.74493 * T , 

cd-sub cd-app 

• LDA model z = 2.1803 - 2.51329 * COP - 1.63738 * 
T, , -H 4.19767 *T, 

cd-sub cd-app 



' From Reddy (2007) with data provided by James Braun for which we 
are grateful. 
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Table 8.5 Analysis results for chiller fault detection example using two 
different methods, namely the OLS regression method and the linear di- 
scriminant analysis (LDA) method. Fault-free data is assigned a class 
value of while faulty data a value of 1 . The models are identified from 
training data and then used to predict class membership for testing data. 
The cut-off score is 0.5 for both approaches, i.e., if calculated score is 



less than 0.5, then the observation is deemed to belong to the fault-tree 
behavior, and vice versa. LDA model scores also fall on either side of 
0.5 but are magnified as compared to the OLS scores leading to more 
robust classification. In this example, there are no misclassification 
data points for either model during both training and testing periods 





Assigned grouping 
Class 


Variables 






OLS model score 


LDA model score 






COP 


T °C 

cd-sub ^ 


cd-app 




Training 





3.765 


4.911 


2.319 


-0.08 


-5.59 









3.405 


3.778 


1.822 


-0.01 


-4.91 









2.425 


2.611 


1.009 


0.09 


-3.95 









4.512 


5.800 


3.376 


0.04 


-4.49 









4.748 


4.589 


2.752 


-0.09 


-5.71 









4.513 


3.356 


1.892 


-0.19 


-6.71 









3.503 


2.244 


1.272 


-0.01 


-4.96 









3.593 


4.878 


2.706 


0.14 


-3.48 









3.252 


3.700 


1.720 


0.00 


-4.83 









2.463 


2.578 


1.102 


0.13 


-3.60 









4.274 


5.422 


3.323 


0.14 


-3.49 









4.684 


4.989 


3.140 


0.03 


-4.58 









4.641 


3.589 


2.188 


-0.14 


-6.18 









3.038 


1.989 


1.061 


0.06 


-4.26 









3.763 


4.656 


2.687 


0.13 


-3.62 









3.342 


3.456 


1.926 


0.11 


-3.79 









2.526 


2.600 


1.108 


0.11 


-3.77 









4.411 


5.411 


3.383 


0.13 


-3.56 









4.029 


3.844 


2.128 


-0.05 


-5.31 









4.443 


3.556 


2.121 


-0.11 


-5.91 









3.151 


2.333 


1.224 


0.04 


-4.42 








3.587 


6.656 


4.497 


0.62 


1.14 








3.198 


5.767 


3.881 


0.60 


0.99 








2.416 


4.333 


2.793 


0.58 


0.74 








2.414 


3.811 


2.722 


0.63 


1.30 








4.525 


7.256 


5.359 


0.65 


1.42 








4.232 


6.022 


4.557 


0.58 


0.81 








3.424 


4.544 


3.538 


0.60 


0.99 








3.382 


6.533 


4.602 


0.74 


2.30 








3.017 


5.667 


3.907 


0.68 


1.72 








3.730 


5.933 


4.372 


0.65 


1.44 








2.395 


3.989 


2.884 


0.68 


1.74 








4.460 


7.356 


5.567 


0.74 


2.29 








4.166 


6.044 


4.697 


0.66 


1.53 








2.974 


4.456 


3.339 


0.65 


1.43 








3.568 


7.033 


4.710 


0.65 


1.47 








3.162 


6.044 


4.070 


0.65 


1.42 








2.382 


4.544 


3.116 


0.69 


1.83 








4.263 


7.567 


5.672 


0.80 


2.89 








3.757 


5.967 


4.368 


0.63 


1.30 








4.132 


6.589 


4.793 


0.62 


1.13 








2.944 


4.811 


3.470 


0.65 


1.47 
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Table 8.5 (continued) 


Assigned grouping 


Variables 






OLS model score 


LDA model score 


Class 


COP 


T °C 


cd-app 




Testing 


3.947 


3.567 


1.914 


-0.07 


-5.54 





2.434 


1.967 


0.873 


0.14 


-3.49 





3.678 


3.389 


1.907 


0.02 


-4.61 





2.517 


2.133 


1.039 


0.16 


-3.28 





2.815 


2.122 


0.946 


0.04 


-4.40 





4.785 


5.100 


3.052 


-0.06 


-5.38 




4.330 


7.656 


5.513 


0.70 


1.91 




3.716 


5.633 


4.082 


0.58 


0.75 




2.309 


4.489 


2.954 


0.65 


1.43 




4.059 


7.467 


5.501 


0.79 


2.84 




2.615 


4.511 


3.239 


0.69 


1.82 




4.539 


7.667 


5.550 


0.66 


1.52 



Fig. 8.5 Scatterplot of the three 
variables. Fault-free data (coded 0) 
is shown as diamonds while faulty 
data (coded 1) is shown as crosses. 
Clearly there is no overlap between 
the data sets and this is supported 
by the analysis which indicates no 
misclassified data points 
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The corresponding scores for both OLS and LDA are 
shown in the last two columns of the table. The threshold 
score z^^^^^ is simply 0.5 which is the average of the two 
groups (coded as and 1). A calculated z score less than 
0.5 would suggest that the data point came from fault-free 
operation, while a score greater than 0.5 would suggest faul- 
ty operation. Note that there are no misclassification data 
points during either training or testing periods for either ap- 
proach. However, note that for OLS there are several instan- 
ces when the score is very close to the threshold value of 
0.5. Figure 8.6 shows the predicted (using the OLS model) 
versus the "measured" values of the two groups which can 
only assume values of either or 1 only. This figure clearly 
indicates the poor modeling capability of the OLS sugges- 
tive of the fact that OLS, though used by some for classifi- 
cation problems, is not really meant for this purpose. ■ 



8.2.5 Bayesian Classification 

The Bayesian approach was addressed in Sect. 2.5 and 
also in Sect. 4.6. Bayesian statistics provide the formal 
manner by which prior opinion expressed as probabilities 
can be revised in the light of new information (from addi- 
tional data collected) to yield posterior probabilities. The 
general approach can be recast into a framework which 
also allows classification tasks to be performed. The sim- 
plified or Naive Bayes method assumes that the predictors 
are statistically independent and uses prior probabilities 
for training the model, which subsequently can be used 
along with the likelihood function of a new sample to clas- 
sify the sample into the most likely group. The training and 
classification are easily interpreted. It is said to be most 
appropriate when the number of predictors is very high. 
Further, it is easy to use, handles missing data well, and 
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Fig. 8.6 Though an OLS model 
can be used for classification, 
it is not really meant for this 
purpose. This is illustrated by the 
poor correspondence between 
observed vs predicted values 
of the coded "class" variables. 
The observed values can assume 
numerical values of either or 1 
only, while the values predicted 
by the model range from -0.2 to 
about 1.2. See Example 8.2.3 
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Table 8.6 Count data of the 200 samp 


es collected and calculated probabilities of attributes (Example 


8.2.4) 






Attribute 




Number of 


samples 




Calculated probabilities of attributes 




Poor 


Average 


Good 


Poor 




Average 


Good 


Type 


Bitiumnous 


72 


20 


8 


0.72 




0.20 


0.08 




Anthracite 





44 


56 


0.0 




0.44 


0.56 


Carbon % 


50-60% 


41 








41/72 












60-70% 


31 


42 





31/72 




42/64 







70-80% 





22 


28 







22/64 


28/64 




80-90% 








36 










36/64 




Total 


72 


64 


64 


- 




- 


- 



requires little computational effort. However, it does not 
handle continuous data well, and does not always yield 
satisfactory results because of its inherent assumption of 
predictor independence which is very often not the case. 
Despite these limitations, it is a useful analysis approach 
to have in one's toolkit. 

Example 8.2.4: Bayesian classification of coal sample ba- 
sed on carbon content 

There are several types of coal used in thermal power 
plants to produce electricity. A power plant gets two types 
of coal: bituminous and anthracite. Each of these two types 
can contain different fixed carbon content depending on the 
time and the location from where the sample was mined. 
Further, each of these types of coal can be assigned into one 
of three categories: Poor, Average and Good depending on 
the carbon content whose thresholds are different for the 



two types of coal. For example, a bituminous sample can be 
graded as "good" while an anthracite sample can be graded 
as "average" even though both samples may have the same 
carbon content. Table 8.6 shows the prior data (of 200 sam- 
ples) and the associated probabilities of the attributes. This 
corresponds to the training data set. 

These values are used to determine the prior probabilities 
as shown in the second column of Table 8.7. The power plant 
operator wants to classify a new sample of bituminous carbon 
which is found to contain 70-80% carbon content. The samp- 
le probabilities are shown in the third column of Table 8.7. 
The values in the likelihood column add up to 0.0332 which 
is used to determine the actual posterior probabilities shown 
in the last column. The category which has the highest poste- 
rior probability can then be identified. Thus, the new samp- 
le will be classified as "average". This is a simple contrived 
example meant to illustrate the concept and to show the vari- 



Table 8.7 Calculation of the prior, sample and posterior probabilities 





Prior probabilities 


Sample probabilities 


Likelihood 


Posterior probabilities 


Poor (p) 


p(p) = (41+31)/200=0.36 


p(s/p) = 0.72x0=0 


0.36x0=0 


0/0.091=0 


Average (a) 


p(a) = (42 + 22)/200=0.32 


p(s/a) = 0.20 X (22/64) =0.0687 


0.32x0.0687=0.022 


0.022/0.0332=0.663 


Good (g) 


p(g) = (28 + 36)/200=0.32 


p(s/g) = 0.08 X (28/64) =0.035 


0.32x0.035=0.0112 


0.0112/0.0332=0.337 


Sum= 0.0332 



240 



8 Classification and Clustering Methods 



ous calculation steps which are straightforward to interpret by 
those with a basic understanding of Bayesian statistics. ■ 



8.3 Heuristic Classification Metliods 

8.3.1 Rule-Based Methods 

The simplest type of rule-based method is the one involving 
"if-then" rules. Such classification rales consist of the "if or 
antecedent part, and the "then" or consequent part of the rale 
(Dunham 2006). These rules must cover all the possibilities, 
and every instance must be uniquely assigned to a particular 
group. Such a heuristic approach is widely used in several 
fields because of the ease of interpretation and implementa- 
tion of the algorithm. The following example illustrates this 
approach. 

Example 8.3.1:^ Rule-based admission policy into the Yale 
medical school 

The selection committee framed the following set of rules 
for interviewing applicants into the school based on under- 
graduate (UG) GPA and MCAT verbal (V) and MCAT quan- 
titative (Q) scores 

• If UA GPA<3.47 and MCAT-V<555, then Class A- reject 

• If UAGPA<3.47 andMCAT-V>555 andMCAT-Q<655, 
then Group B, reject 

• If UAGPA<3.47 andMCAT-V>555 andMCAT-Q>655, 
then Group C, interview 

• If UA GPA>3.47 and MCAT-V<535, then Group D, reject 

• If UA GPA>3.47 and MCAT-V>535, then Group E, 
interview. 

It is clear that the set of rules is comprehensive and would 
cover every eventuality. For example, an applicant with UA 
GPA = 3.6 and MCAT-V = 525 would fall under group D and 
be rejected without an interview. Thus, the pre-determined 
threshold or selection criteria of GPA, MCAT- V and MC AT-Q 
are in essence the classification model, while classification 
of a future applicant is straightforward. ■ 



8.3.2 Decision Trees 

Probability trees were introduced in Sect. 2.2.4 as a means 
of dividing a decision problem into a hierarchical structure 
for easier understanding and analysis. Very similar in con- 
cept are directed graphs or decision trees which are predic- 
tive modeling approaches that can be used for classification, 
clustering as well as for regression model building. As sta- 
ted earlier, classification problems differ from regression 
problems in that the response variable is categorical in the 



former, and continuous in the latter. Treed regression is ad- 
dressed in Sect. 8.4, while this section limits itself to classifi- 
cation problems. Decision trees essentially divide the spatial 
space such that each branch can be associated with a diffe- 
rent sub-region. A rale is associated with each node of the 
tree, and observations which satisfy the rule are assigned to 
the corresponding branch of the tree. Terminal nodes are the 
end-nodes of the tree. Though similar to if-then rales in their 
structure, decision trees are easier to comprehend in more 
complex situations, and are more efficient computationally. 

Example 8.3.2: Consider the same problem as that of 
Example 8.3.1. These if-then rules can be represented by a 
tree diagram as shown in Fig. 8.7. The top node contains the 
entire data set, while each node further down contains a sub- 
set of data till the branches end in a unique branch represen- 
ting one of the five possible groups (Groups A-E). In many 
ways, this diagram is easier to comprehend than the if-then 
rales, and further allows intuitive tweaking of the rales as 
necessary. The numbers in parenthesis shown in Fig. 8.7 re- 
present the numbers of applicants at different stages of the 
tree (out of a total of 727). Note that 48 1 applicants made the 
cut (sum of those who ended up in Group C or Group E) and 
would be called for an interview. If the school decides to cut 
this number down in future years, it can use the set of 727 
applicants, and iteratively evaluate the effect of modifying 
the rales at different levels till the intended target reduction 
is achieved. This can be programmed into a computer to fa- 
cilitate the search for an optimum set of rales. A simpler and 
more intuitive manner is to study Fig. 8.7, and evaluate cer- 
tain heuristic modifications. For example, more successful 
applicants fall in Group E than Group C, and so increasing 
the MCAT-V score threshold from 535 to something a little 
higher may be an option worth evaluating. 



GPA 
(727) 



Reject 
(Group A) 




Interview 
(Group E) 



(122) 
Reject (127)^ 

Interview 
(Group 0) 



(Group B) 



From Milstein et al. (1975) 



Fig. 8.7 Tree diagram for the medical school admission process with 
five terminal nodes, each representing a different group (Example 
8.3.2). This is a binary tree with three levels 
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8.3.3 k Nearest Neighbors 

The k nearest neighbor (kNN) method is a conceptually 
simple pattern recognition approach that is widely used for 
classification. It is based on the distance measure, and requi- 
res a training data set of observations from different groups 
identified as such. If a future object needs to be classified, 
one determines the point closest to this new object, and sim- 
ply assigns the new object to the group to which the closest 
point belongs. Thus, no training as such is needed. The clas- 
sification is more robust if a few points are used rather than a 
single closest neighbor. This, however, leads to the following 
issues which complicate the classification: 
(i) how many closest points "k" should be used for the clas- 
sification, and 
(ii) how to reconcile differences when the nearest neighbors 
come from different groups. 

Because of different ways by which the above issues can 
be addressed, kNN is more of an algorithm than a clear-cut 
analytical procedure. An allied classification method is the 
closest neighborhood scheme, where an object is classified 
in that group for which its distance from the center of that 
group happens to be the smallest as compared to its distances 
from the centers of other possible groups. Training would 
involve computing the centers of each group and distances 
of individual objects from this center. 

A redeeming feature of kNN is that it does not impose a 
priori any assumptions about the distribution from which the 
modeling sample is drawn. Stated differently, kNN has the 
great advantage that it is asymptotically convergent, i.e., as 
the size of the training set increases, misclassification errors 
will be minimized if the observations are independent re- 
gardless of the distribution from which the sample is drawn. 
kNN can be adapted to a wide range of applications, with the 
distance measure modified to suit the particular application. 
The following example illustrates one such application. 

Example 8.3.3: Using k nearest neighborhood to calculate 
uncertainty in building energy savings 

This example illustrates how the nearest neighbor appro- 
ach can be used to estimate the uncertainty in building ener- 
gy savings after energy conservation measures (ECM) have 
been installed (adapted from Subbarao et al. 2011). Exam- 
ple 5.7.1 and Problem Pr. 5.12 describe the analysis metho- 
dology which consists of four steps: 

(i) identify a baseline multivariate regression model for 
energy use against climatic and operating variables be- 
fore the retrofits were implemented, 
(ii) use this baseline model along with post-retrofit clima- 
tic and operating variables data to predict energy use 
E , , reflective of consumption during the pre-retro- 

pre. model ^ c ir 

fit Stage, 



(iii) compute energy savings as the difference between the 
model predicted baseline energy use and the actual 
measured energy use during the post-retrofit period. 



^savings 



(^pre, 



pre, model 



^post, meas 



) and. 



(iv) determine the uncertainty in the energy savings based 
on the multivariate baseline model goodness-of-fit 
(such as the RMSE) and the uncertainty in the post-re- 
trofit measured energy use. 

Unfortunately, step (iv) is not straightforward. Energy use 
models identified by global or year-long data do not adequa- 
tely capture seasonal changes in energy use due to control and 
operation changes done to the various building systems since 
these variables do not appear explicitly as regressor variab- 
les. Hence, classical models identified from whole-year data 
are handicapped in this respect, and this often leads to model 
residuals that have different patterns during different times 
of the year. Such improper residual behavior results in im- 
proper estimates of the uncertainty surrounding the measured 
savings. An alternative is to use the nearest neighbors appro- 
ach which relies on "local" model behavior as against global 
estimates such as the overall RMSE. However, the k-nearest 
neighbor approach requires two aspects to be defined specific 
to the problem at hand: (i) definition of the distance measure, 
(ii) and deciding on the number of neighbor points to select. 

Let us assume that a statistical model with p regressor 
parameters has been identified from the pre-retrofit period 
based on daily variables. Any day, for specificity, a pre-retro- 
fit day j, can be represented as a point in this p-dimensional 
space. If data for a whole year is available, the days are re- 
presented by 365 points in this p-dimensional space. The un- 
certainty in this estimate is better characterized by identify- 
ing a certain number of days in the pre-retrofit period which 
closely match the specific values of the regressor set for the 
post-retrofit day j, and then determining the error distribution 
from this set of days. Thus, the method is applicable regard- 
less of the type of model residual behavior encountered. 

All regressors do not have the same effect on the response 
variable; hence, those that are more influential need to be 
weighted more, and vice versa. The definition of the distan- 
ce d.. between two given days i and j specified by the set of 
regressor variables x, and x, , is defined as: 



'^wjiXkj -Xkjf/p 



(8.5) 



\k= 



where the weights Wj are given in terms of the derivative of 
energy use with respect to the regressors: 

^ ^ pre, model 



Wk = 



dxk 



(8.6) 



The partial derivatives can be determined numerically by 
perturbation, as discussed in Sect. 3.7.1. Days that are at a 
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Fig. 8.8 Illustration of the neig- 
hborhood concept for a baseline 
with two regressors (dry-bulb 
temperature DBT and dew point 
temperature DPT) for Example 
8.3.3. If the DBT variable has 
more "weight" than DPT on the 
variation of the response variable, 
this would translate geometri- 
cally into an elliptic domain as 
shown. The data set of "neighbor 
points" to the post datum point 
(75, 60) would consist of all po- 
ints contained within the ellipse. 
Further, a given point within this 
ellipse may be assigned more 
"influence" the closer it is to 
the center of the ellipse. (From 
Subbaraoetal. 2011) 
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given "energy distance" from a given day lie on an ellipsoid 
whose axis in tiie k-direction is proportional to (1/V^)- 
This concept is illustrated in Fig. 8.8. 

The selection of the number of neighbor points is somew- 
hat arbitrary and can be done either by deciding on a maxi- 
mum distance, or selecting the number of points based on 
the confidence level sought; the latter approach has been ad- 
opted below. One can associate an ellipsoid with each post- 
retrofit day in the parameter space; pre-retrofit days that lie 
inside this ellipsoid contribute to the determination of un- 
certainty in the estimation of the savings for this particular 
post-retrofit day. The distribution of uncertainties in the esti- 
mate iip^,; „,j3^ J is given by the distribution of the estimates of 
(£pre, model, j " ^post, meas, j ) iusidc the cllipsoid. The overall 
size of the ellipsoid is determined by the requirements of 



making it as small as possible (so that variations in the daily 
energy use are small) while having a sufficient number of 
pre-retrofit days within the ellipsoid. 

The proposed approach is illustrated with a simple exam- 
ple involving synthetic daily data of building energy use. 
Though the simulation involves numerous variables, only 
the following variables are considered as they relate to a sub- 
sequent statistical model: (i) regressors: ambient air dry-bulb 
temperature (DBT) and dew point temperature (DPT), and 
(ii) response: cooling coil thermal load (Q ). The hourly data 
has first been separated in weekdays and weekends and then 
averaged/summed to represent daily values. Only the week- 
day data set consisting of 249 values have been used to illus- 
trate the concept of the proposed methodology. Figure 8.9 is 
a scatter plot of cooling load Q versus DBT. This is a typical 



Fig. 8.9 Scatter plot of thermal 
cooling load Q versus DBT for 
Example 8.3.3 
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plot showing strong change point non-hnear behavior and a 
large cloud at the high DBT range due to humidity loads. It 
would very difficult to model this behavior using traditional 
regression methods that would yield realistic uncertainty es- 
timates at specific local ranges. 

Previous studies have demonstrated that only when the 
outdoor air dew point temperature is higher than about 55 °F 
(which is usually close to the cold air deck temperature of the 
air handler unit) do humidity loads appear on the cooling coil 
due to the ventilation air brought into the building. Hence, the 
variable (DPT-55)* rather than DPT is used as the regressor 
variable in the model which is such that: (DPT-55)* = when 
DPT < 55 °F, and (DPT-55)* = (DPT-55) when DPT > 55 °F. 

A more critical issue with multivariate models in general 
is the collinear behavior between regressors; in the case of 
building energy use models, the most critical is the signifi- 
cant correlation between DBT and DPT. If one ignores this, 
then the energy use model may have physically unreasona- 
ble internal parameter values, but continue to give reaso- 
nable predictions. However, the derivatives of the response 
variable with respect to the regressor variables can be very 
misleading. One variable may "steal" the dependence from 
another variable, which will affect the weights assigned to 
the different regressors. To mitigate this problem, a model of 
(DPT-55°F)* vs DBT is identified, and the residuals of this 
model (ResDPT) are used instead of DPT in the regressor 
set for energy use modeling. Though this procedure assigns 
more influence to DBT, the collinearity effect is reduced. 
The model could be a simple linear model or could be an ar- 
tificial neural network (ANN) multi-layer perceptron model, 
depending on the preference of the analyst. Consider the case 
when one wishes to determine the uncertainty in the respon- 
se variable corresponding to a set of operating conditions 
specified by DBT=75°F and ResDPT=5°F which results in 
Q^ = 233.88 MBtu/day. The ANN 3-10-1 model was used to 
numerically determine the gradients of these two regressors: 



96. 
d(DBT) 



= 5.0685 and 



diResDPT) 



= 7.606 



The "distance" statistic for each of the 249 days in our 
synthetic data set has been computed following Eqs. 8.5 and 
8.6, and the data sorted by this statistic. The top 20 data po- 
ints (with smallest distance) are shown in Table 8.8, as are 
the regressor values, the measured and predicted values, and 
their residuals. The last column assembles the "distance" va- 
riable. Note that this statistic varies from 1.78 to 23.06. In 
case the 90% confidence intervals are to be determined, a 
distribution-free approach is to use the corresponding valu- 
es of the 5* and the 95* percentiles of the residuals. Since 
there are 20 points, the two extreme values of the residuals 
shown in Table 8.8, which yields the 90% limits (-8.43 and 



8.29 which are bolded) around the model predicted value of 
233.88 MBtu/day for the cooling energy use. In this case, 
the distribution is fairly symmetric, and one could report a 
local prediction value of (233.88 ±8.3 MBtu/day) at the 90% 
confidence level. If the traditional method of reporting un- 
certainty were to be adopted, the RMSE for the 2-10-1 ANN 
model, found to be 5.7414 (or a CV = 6.9%), would result in 
(±9.44 MBtu/day) at the 90% confidence level. Thus, using 
the k-nearest neighbors approach has led to some reduction 
in the uncertainty interval around the local prediction value; 
but more importantly, this estimate of uncertainty is more 
realistic and robust since it better represents the local beha- 
vior of the relationship between energy use and the regressor 
variables. Needless, to say, the advantage of this entire met- 
hod is that even when the residuals are not normally distribu- 
ted, the data itself can be used to ascertain statistical limits. 



8.4 Classification and Regression Trees 
(CART) and Treed Regression 

Classification and regression trees (CART) approach is a 
non-parametric decision tree technique that can be applied 
either to classification or regression problems, depending 
on whether the dependent variable is categorical or numeric 
respectively. Recall that nonparametric methods are those 
which do not rely on assumptions about the data distribution. 
In Sect. 8.3.2 dealing with decision trees, the model-building 
step was not needed because the tree structure, the attributes 
and their decision rules were specified explicitly. This will 
not be the case in most classification problems. Constructing 
a tree is analogous to training in a model-building context, 
but here, it involves deciding on the following choices or col- 
lection of rules (Dunham 2006): 

(i) choosing the splitting attributes, i.e., the set of import- 
ant variables to perform the splitting; in many enginee- 
ring problems, this is a moot step, 
(ii) ordering the splitting attributes, i.e., ranking them by 
order of importance in terms of being able to explain the 
variation in the dependent variable, 
(iii) deciding on the number of splits of the splitting attribu- 
tes which is dictated by the domain or range of variation 
of that particular attribute, 
(iv) defining the tree structure, i.e., number of nodes and 

branches, 
(v) selecting stopping criteria which are a set of pre-defined 
rules meant to reveal that no further gain is being made 
in the model; this involves a trade-off between accuracy 
of classification and performance, and 
(vi) pruning a tree which involves making modifications to 
the tree constructed using the training data so that it ap- 
plies well to the testing data. 
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Table 8.8 Table showing how the model residual values can be used 
to ascertain pre-specified confidence levels of the response adopting 
a non-parameteric approach (Example 8.3.3). Values shown of the 
regressor and response variables are for the 20 closest neighborhood 



points from a data set of 249 points. Reference point of DBT=75°F 
and DPT-55°=5°F determined using ANN model 2-10-1 are shown, as 
are the "distance" and the model residual values. The residual values 
shown bolded are the 5 and 95% values 





DBT 

(°F) 


ResDPT 

(°F) 


Q_c_Meas 
(10' Btu/day) 


Q_c_Model 
(10"^ Btu/day) 


Residuals 
(10" Btu/day) 


Distance 


1 


74.67 


4.76 


225.59 


233.88 


8.29 


1.78 


2 


75.42 


4.22 


236.36 


231.39 


-4.97 


4.47 


3 


77.08 


5.45 


240.21 


248.19 


7.98 


7.85 


4 


76.54 


6.23 


251.99 


252.94 


0.95 


8.62 


5 


77.00 


4.08 


239.01 


238.55 


-0.46 


8.71 


6 


77.46 


4.32 


241.97 


242.60 


0.64 


9.54 


7 


72.83 


3.88 


224.54 


217.61 


-6.93 


9.82 


8 


77.75 


5.00 


240.63 


247.13 


6.50 


9.86 


9 


75.04 


2.88 


221.23 


223.84 


2.61 


11.40 


10 


75.29 


2.85 


224.36 


222.07 


-2.29 


11.61 


11 


74.71 


2.80 


214.56 


220.89 


6.33 


11.89 


12 


78.08 


6.08 


252.04 


256.14 


4.10 


12.49 


13 


77.33 


3.07 


231.57 


234.58 


3.00 


13.32 


14 


71.42 


3.37 


210.83 


204.44 


-6.39 


15.53 


15 


78.83 


3.61 


238.85 


242.55 


3.70 


15.63 


16 


73.67 


2.19 


200.35 


209.02 


8.66 


15.87 


17 


79.42 


3.60 


250.10 


241.68 


-8.43 


17.54 


18 


73.00 


1.62 


213.35 


204.23 


-9.12 


19.55 


19 


79.96 


2.74 


240.66 


239.73 


-0.92 


21.53 


20 


78.42 


1.37 


224.75 


230.06 


5.32 


23.06 



Classification and regression trees (CART) is one of an 
increasing number of computer intensive methods which 
perform an exhaustive search to determine best tree size and 
configuration in multivariate data. While being a fully auto- 
matic method, it is flexible, powerful and parsimonious, i.e., 
it identifies a tree with the fewest number of branches. Anot- 
her appeal of CART is that it chooses the splitting variables 
and splitting points that best discriminate between the outco- 
me classes. The algorithm, however, suffers from the danger 
of over- fitting, and hence, a cross-validation data set is essen- 
tial. This would assure that the best tree configuration is se- 
lected which minimizes misclassification rate, and also give 
realistic estimates of the misclassification rate of the final 
tree. Most trees, including CART are binary decision trees 
(i.e., the tree splits into two branches at each node), though 
they do not necessarily have to be so. Also, each branch of 
the tree ends in a terminal node while each observation falls 
into one and exactly one terminal node. The tree is created 
by an exhaustive search performed at each node to determine 
the best split. The computation stops when any further split 
does not improve the classification. Treed regression is very 
similar to CART except that the latter fits the mean of the de- 
pendent variable in each terminal node, while treed regress- 
ion can assume any functional form. 

CART and treed regression are robust methods which are 
ideally suited for the analysis of complex data which can be 



numeric or categorical, involving nonlinear relationships, 
high-order interactions, and missing values in either respon- 
se or regressor variables. Despite such difficulties, the met- 
hods are simple to understand and give easily interpretable 
results. Trees explain variation of a single response variab- 
le by repeatedly splitting the data into more homogeneous 
groups or spatial ranges, using combinations of explanatory 
variables that may be categorical and/or numeric. Each group 
is characterized by a typical value of the response variable, 
the number of observations in the group, and the values of 
the explanatory variables that define it. The tree is represen- 
ted graphically, and this aids exploration and understanding. 
Classification and regression have a wide range of applica- 
tions, including scientific experiments, medical diagnosis, 
fraud detection, credit approval, and target marketing (Hand 
1981). The book by Breiman et al. (1984) is recommended 
for those interested in a more detailed understanding of 
CART and its computational algorithms. Even though the 
following example illustrates the use of treed regression in 
a regression context with continuous variables, an identical 
approach would be used with categorical data. 

Example 8.4.1: Using treed regression to model atmosphe- 
ric ozone variation with climatic variables. 

Cleveland (1994) presents data from 111 days in the New 
York City metropolitan region in the early 1970s consisting 



8.5 Clustering Methods 



245 




N 50 



50 100 150 200 250 300 
Solar radiation (langleys) 



60 



70 80 90 

Temperature (°F) 



a 



10 15 

Wind Speed (mph) 



20 





Wind speed < 6 mph 


150- 




^ 


S 100- 




t J^'^ 


0) 

1 50- 


^^ 












150 








Si 








Q. 




^ 100 












o 




S 50 


!• 1 '!■ 





■;,..■..:....:!• ,1 ^ 



Ozone (ppb) 

CJl O Ul 

o o o 


; 


1 ' '. . .* 





•:-:i;.:r.:, •■ ■ . . 



50 100 150 200 250 300 
Solar radiation (langleys) 

Wind speed > 6 mph; temperature < 82.5°F 





70 80 90 

Temperature (°F) 

Wind speed > 6 mph; temperature > 82.5°F 



Ozone (ppb) 

8 8 8 


, 


^^^t^S-^,^ 





^^":^ 



9.0706 -f 0.41211 radiation 




10 15 

Wind Speed (mph) 



20 



Temperature Temperature 
< 82.5 > 82.5 



-38.953 -f 0.85932 temperature 1 1 3.84 - 4.8525 wind 



Fig. 8.10 a Scatter plot.s of ozone versus climatic data, b Scatter plots Sons), c Treed regression model for predicting ozone level against cli- 
and linear regression models for the three terminal nodes of the treed matic variables. (From Kotz 1997 by pennission of John Wiley and 
regression model. (From Kotz 1 997 by permission of John Wiley and Sons) 



of the ozone concentration (an index for air pollutant) in 
parts per billion (ppb) and three climatic variables: ambient 
temperature (in°F), wind speed (in mph) and solar radiation 
(in langleys). It is the intent to develop a regression model 
for predicting ozone levels against the three variables. The 
pairwise scatter plots of ozone (the dependent variable) and 
the other three variables are shown in Fig. 8.10a. One notes 
that though some sort of correlation exits, the scatter is fairly 
important. An obvious way is to use multiple regression with 
inclusion of higher order terms as necessary. 

An alternative, and in many cases superior, approach is 
to use treed regression. This involves, partitioning the spa- 
tial region into sub-regions, and identifying models for 
each region or terminal node separately. Kotz (1997) used a 
treed regression approach to identify three terminal nodes as 
shown in Fig. 8.10c: (i) wind speed < 6 mph (representative 
of stagnant air conditions), (ii) wind speed > 6 mph and am- 
bient temperature <82.5°F, and (iii) wind speed > 6 mph and 
ambient temperature >82.5°F. The corresponding pairwise 
scatter plots are shown in Fig. 8.10b while the individual 
models are shown in Fig. 8.10c. One notes that though there 
is some scatter around a straight line for the three terminal 
nodes, it is much less than adopting a straightforward multip- 
le regression approach. ■ 



8.5 Clustering Methods 

8.5.1 Types of Clustering Methods 

The aim of cluster analysis is to allocate a set of observation 
sets into groups which are similar or "close" to one another 
with respect to certain attribute(s) or characteristic(s). Thus, 
an observation can be placed in one and only one cluster. 
For example, performance data collected from mechanical 
equipment could be classified as representing good, faulty or 
uncertain operation. 

In general, the number of clusters is not predefined and 
has to be gleaned from the data set. This and the fact that 
one does not have a training data set to build a model make 
clustering a much more difficult problem than classificati- 
on. A wide variety of clustering techniques and algorithms 
has been proposed, and there is no generally accepted best 
method. Some authors (for example, Chatfield 1995) point 
out that, except when the clusters are clear-cut, the resulting 
clusters often depend on the analysis approach used and so- 
mewhat subjective. Thus, there is often no one single best 
result, and there exists the distinct possibility that different 
analysts will arrive at different results. 

Broadly speaking, there are two types of clustering met- 
hods both of which are based on distance-algorithms where 
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objects are clustered into groups depending on their relative 
closeness to each other. Such distance measures have been 
described earlier: the Euclidian distance given by Eq. 8.2 or 
the Mahanabolis distance given by Eq. 8.3. One clustering 
approach involves partitional clustering where non-overlap- 
ping clusters are identified. The second involves hierarchic 
clustering which allows one to identify closeness of different 
objects at different levels of aggregation. Thus, one starts by 
identifying several lower-level clusters or groups, and then 
gradually merging these in a sequential manner depending 
on their relative closeness, so that finally only one group re- 
sults. Both approaches rely, in essence, in identifying those 
which exhibit small within-cluster variation as against large 
between-cluster variation. Several algorithms are available 
for cluster analysis, and the intent here is to provide a con- 
ceptual understanding. 



8.5.2 Partitional Clustering Methods 

Partitional clustering (or disjoint clusters) determines the 
optimal number of clusters by performing the analysis with 
different pre-selected number of clusters. For example, if a 
visual inspection of the data (which is impossible in more 
than three dimensions) suggests, say, 2, 3, or 4 clusters, the 
analysis is performed separately for all three cases. The ana- 
lysis would require specifying a criterion function used to 
assess the goodness-of-fit. A widely used criterion is the wit- 
hin-cluster variation, i.e., squared error metric which mea- 
sures the square distance from each point within the clus- 
ter to the centroid of the cluster (see Fig. 8.11). Similarly, a 
between-cluster variation can be computed representative of 
the distance from one cluster center to another. The ratio of 
the between-cluster variation to the average within clusters 
is analogous to the F-ratio used in ANOVA tests. Thus, one 
starts with an arbitrary number of cluster centers, assigns ob- 
jects to what is deemed to be the nearest cluster center, com- 
putes the F-ratio of the resulting cluster, and then jiggles the 
objects back and forth between the clusters each time re-cal- 
culating the mean so that the F ratio is maximized or is suffi- 
ciently large. It is recommended that this process be repeated 




Fig. 8.11 Schematic of two clusters with individual points shown as x. 
The within-cluster variation is the sum of the individual distances from 
the centroid to the points within the cluster, while the between-cluster 
variation is the distance between the two centroids 



with different seeds or initial centers since their initial se- 
lection may result in cluster formations which are localized. 
This tedious process can only be done by computers for most 
practical problem. A slight deviant of the above algorithm is 
the widely used k-means algorithm where instead of a F-test, 
the sum of the squared errors is directly used for clustering. 
This is best illustrated with a simple two-dimension sample. 

Example 8.5.1:^ Simple example of the k-means clustering 
algorithm. 

Consider five objects or points characterized by two Car- 
tesian coordinates: X| = (0,2); x, = (0,0), X3 = (1.5,0), x^ =(5,0), 
and X5 = (5,2). The process of clustering these five objects is 
described below. 

(a) Select an initial partition of k clusters containing ran- 
domly chosen samples and compute their centroids 
Say, one selects two clusters and assigns to cluster C^ = 
(Xj, x^, x^) and C^ = (x^, x^). Next, the centroids of the 
two clusters are determined: 

Mi = {(0 -h -h 5)/3, (2 + Q + 0)/3} 

= {1.66, 0.66} 
M2 = {(1.5 + 5)/2,(0 + 2)/2} 

= {3.25,1.0} 



(b) Compute the within-cluster variations: 
e\ = [(0 - 1.66)2 + (2 - 0.66)2] ^ ^^q 



1.66)2 



+ (0- 
+ (0- 
[(1-5 
+ [(5 
8.12 



0.66)2] 
0.66)2] 



{(5-1.66)2 
: 19.36 



3.25)2 -I- (0 - 1)2] 



3.25)2 + (2 ■ 



1)2] 



and the total error E^ ^e\ + el^ 19.36 + 8.12 = 27.48 

(c) Generate a new partition by assigning each sample to 
the closest cluster center 

For example, the distance of x^ from the centroid Mj is 
<i(Mi,xi) = (1.662 + 1.342)1/2 = 2.14, while that for 
d{M2,x\) — 3.40. Thus, object x^ will be assigned to 
the group which has the smaller distance, namely C^. Si- 
milarly, one can compute distance measures of all other 
objects, and assign each object as shown in Table 8.9. 

(d) Compute new cluster centers as centroids of the clusters 
The new cluster centers are Mj={0.5, 0.67} and 
^^={5.0, 1.0} 

(e) Repeat steps (b) and (c) until an optimum value is found 
or until the cluster membership stabilizes 



From Kantardzic (2003) by permission of John Wiley and Sons. 
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Table 8.9 Distance measures of the five objects witli respect to tlie 
two groups 



d(Mi,xi) = 2.14 


d{M2,Xl)-- 


= 3.40 


So assign => xi e Ci 


d(MuX2)= 1.79 


d(M2,X2) = 


= 3.40 


So assign =^ X2 € Ci 


rf(Mi,jC3) = 0.83 


d(M2,Xi) = 


= 2.01 


So assign =^ X3 e C\ 


d(Mi,X4) = 3Al 


rf(M2,X4) = 


= 2.01 


So assign ^ X4 & C2 


d(Mi,X5) = 3.60 


d(M2,xs)- 


= 2.01 


So assign =)• X5 e C2 



For the new clusters C=(x^, x^, x^) and C^=(x^, x^), the 

within-cluster variation and the total square errors are: 

ef = 4.17, ej = 2.00, E^ = 6.17. Thus, the total error 

has decreased significantly just after one iteration. ■ 

It is recommended that the data be plotted so that starting 

values of the cluster centers could be visually determined. 

Though this is a good strategy in general, there are instances 

when this is not optimal. Consider the data set in Fig. 8.12a, 

where one would intuitively draw the two clusters (dotted 

circles) as shown. However, it turns out that the split depicted 

in Fig. 8.12b results in lower sum of squared error which is 

the better manner of performing the clustering. Thus, initial 

definition of cluster centers done visually have to be verified 

by analytical measures. Though the k-means clustering met- 
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hod is very popular, it is said to be sensitive to noise and 
outlier points. 

Example 8.5.2:'' Clustering residences based on their air- 
conditioner electricity use during peak summer days 

Electricity used by residential air conditioners in the U.S. 
has been identified as being largely responsible for the high 
electric demand faced by electric utilities during hot sum- 
mer afternoons. Being able to classify residences based on 
their diurnal profiles during such critical days would be 
advantageous to electric utilities. For example, they would 
be able to better design and implement cost-effective peak 
shaving strategies (such as direct load control, cool storage, 
offering financial incentives, ...). Hourly data for 73 resi- 
dential homes was collected during an entire summer. Data 
corresponding to six of the peak days were extracted with 
the intent to classify residences based on their similarity in 
diurnal profiles during these days. Clustering would require 
two distinct phases: first, a process whereby the variability 
of the patterns can be quantified in terms of relatively few 
statistical parameters, and second, a process whereby objects 
are assigned to specific groups for which both the nuclei and 
the boundaries need to be determined. 

First, the six diurnal profiles for each of the customers 
were averaged so as to obtain a single diurnal profile. The 
peak period occurs only for a certain portion of the day, in 
this case, from 2:00-8:00 pm, and hence the diurnal profi- 
le during this period is of greater importance than that out- 
side this period. The most logical manner of quantifying the 
hourly variation during this period is to compute a measure 
representative of the mean and one of the standard deviation. 
Hence, a "peak size" was defined as the fraction of the air- 
conditioner use during the peak period divided by the total 
daily usage, and a "peak variation" as the standard deviation 
of the hourly air-conditioning values during the same peak 
period. 

The two-dimensional data is plotted in Fig. 8.13a, while a 
partitional disjoint clustering approach was used to identify the 
nine clusters (Fig. 8. 13b). While points shown as are outliers, 
one does detect a logical and well-behaved clustering of the 
rest of the objects. Each of the individual clusters can be inter- 
preted based on its peak pattern; only three of which are shown 
in Fig. 8.13. It can be noted that Cluster 1 includes individuals 
with a flat usage pattern, while Cluster 4 comprises of custo- 
mers with properly sized and well-operated air-conditioners, 
and Cluster 7 those with a much higher peak load probably 
because of varying their thermostat setting during the day. ■ 



Fig. 8.12 Visual clustering may not always lead to an optimal splitting. 
a Obvious manner of clustering, b The better way of clustering which 
results in lower sum of squared error. (From Duda et al. 2001 by per- 
mission of John Wiley and Sons) 



' From Hull and Reddy (1990). 
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Fig. 8.13 Clustering of the 73 
residences based on their air-con- 
ditioner use during peak hours. 
a Normalized two dimensional 
data, b Clustered data. 
c Normalized profiles of three of 
the eight clusters of homeowners 
identified. (From Hull and Reddy 
1990) 



0.03 



0.02 



o 
a. 



0.01 




0.12 



0.08 



0.02 



0.0 



0.12 



0.08 



0.02 



0.0 



Cluster 4 




10 



15 



20 



Cluster 7 










- ^. ..>r:-\. 


:i>^:i^ 


1 1 




1 


1 


1 



10 



15 



20 



Hour of day 



8.5.3 Hierarchical Clustering Methods 

Another cluster identification algorithm, called hierarchical 
clustering, does not start by partitioning a set of objects into 
mutually exclusive clusters, but forms them sequentially in a 



nested fashion. For example, the eight objects shown at the 
left of the tree diagram (also called dendrogram) in Fig. 8.14a 
are merged into clusters at different stages depending on their 
relative similarity. This allows one to identify objects which 
are close to each other at different levels. The sets of objects 
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Fig. 8.14 Example of hierarchi- 
cal agglomerative clustering, a 
Tree diagram or dendrogram, b 
Different levels of clustering the 
tree diagram 




8 



(a) Eight clusters 



(c) Four clusters 




(e) One clusters 



© 



(b) Five clusters 




(d) Two clusters 



(O,, O^), (O^, O5), and (O^, O^), are the most similar to each 
other, and are merged together resulting in the five cluster 
diagram in Fig. 8.14b. If one wishes to form four clusters, it 
is best to merge object O^ with the first set (Fig. 8.14b). This 
merging is continued till all objects have been combined into 
a single undifferentiated group. Though the last step has little 
value, it is the sub-levels which provide insights into the ex- 
tent that different objects are close to one another. This pro- 
cess of starting with individual objects and repeatedly mer- 
ging nearest objects into clusters till one is left with a single 
cluster is referred to as agglomerative clustering. 

Another approach called divisive clustering tackles the 
problem in the other direction, namely starts by placing all 
objects in a single cluster and repeatedly splitting the clusters 
in two until all objects are placed in their own cluster. The- 
se two somewhat complementary approaches are akin to the 
forward and backward stepwise regression approaches. Note 
that both approaches are not always consistent in the way 
they cluster a set of data. 

Hierarchical techniques are appropriate for instances 
when the data set has naturally-occurring or physically-ba- 
sed nested relationships, such as plant or animal taxonomies 
(Dunham 2006). The parti tional clustering algorithm is ad- 
vantageous in applications involving large data sets for which 
hierarchical clustering is computationally complex. A very 
basic overview of clustering methods has been given in this 
chapter; the interested reader can refer to pertinent texts such 
as Dunham (2006), Manly (2005), Hand (1981) or Duda et 
al. (2001) for in-depth mathematical treatment, description 
of the clustering algorithms and various applications. 



Problems 

Pr. 8.1 Rework Example 8.2.4 for a new sample of anthraci- 
te coal sample found to contain 60-70% carbon content. 

Pr. 8.2 Consider the two dimensional data of three groups 
shown in Table 8.10. Using the standardized Euclidian dis- 
tance: 

(i) Identify boundaries for the three groups so as to make 
the misclassification rates more or less equal among all 
three groups. State the misclassification rates, 
(ii) Classify the following four points into one of the three 
groups: (35,12), (18,20), (28,16) and (12,35) 

Pr. 8.3 The intent is to cluster the ten cities shown in 
Table 8. 1 1 into two or three groups. 

(a) Perform hierarchical clustering and generate the den- 
drogram. Identify the different levels. 



Table 8.10 Data tabl 


: for Problem 8.2 






Group A 




Group B 




Group C 
^1 




''1 


^2 


^1 


^2 


^2 


39 


14 


25 


14 


13 


17 


47 


8 


22 


16 


22 


26 


42 


10 


23 


17 


19 


23 


32 


12 


22 


16 


11 


15 


43 


13 


31 


15 


12 


19 


35 


12 


27 


14 


20 


15 


41 


12 


24 


19 


16 


24 


44 


8 


31 


12 


18 


23 
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Table 8.1 1 Data table for Problem 8.3 




City 


Average horizontal 
annual radiation 
(MJ/m^-day) 


Average annual 
ambient temperatu- 
re (°C) 


Miami, USA 


16.764 




23.7 


New York, USA 


12.515 




12.6 


Phoenix, USA 


21.281 




20.0 


Kabul, Afghanistan 


17.439 




12.0 


Melbourne, Australia 


15.203 




14.8 


Beijing, China 


14.598 




12.0 


Cairo, Egypt 


19.979 




22.0 


New Delhi, India 


19.698 




25.3 


Sede Boqer, Israel 


19.880 




18.3 


Bangkok, Thailand 


17.053 




31.8 


London, UK 


9.127 




10.5 



(b) Perform a partitional clustering along the lines shown in 
Example 8.5.1. 

(c) Compare both the approaches in terms of ease of use, 
interpretation and simplicity. 

Pr. 8.4 The Human Development Index (HDI) is a compo- 
site statistic used to rank countries by level of "human de- 
velopment" which includes life expectance, education level 
and per-capita gross national product which is an indicator of 



standard of living. Table 8.12 assembles values of HDI and 
energy use per capita for 30 countries classified as Group A, 
B and C. 

(a) Develop classification models for the three groups (plot 
the data to detect any trends). You can evaluate the sta- 
tistical classification method or the discriminant analysis 
as appropriate. Report misclassification rates if any. 

(b) Classify the three countries shown at the end (France, 
Israel and Greece). 
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Table 8.1 2 


Data table for Problem 8.4 










Group 


Country 


HDI 

(2010) 


Energy (W)/capita Group 
(2003) 


Country 


HDI 

(2010) 


Energy (W)/capita 
(2003) 


A 


Norway 


0.938 


7,902 B 


Chile 


0.783 


2,200 




Australia 


0.937 


7,622 


Argentina 


0.775 


2,097 




New Zealand 


0.907 


5,831 


Libya 


0.755 


4,266 




United States 


0.902 


10,381 


Saudi Arabia 


0.752 


7,434 




Ireland 


0.895 


5,009 


Mexico 


0.75 


2,041 




Netherlands 


0.89 


6,675 


Russia 


0.719 


5,890 




Canada 


0.888 


11,055 


Iran 


0.702 


2,709 




Sweden 


0.885 


7,677 


Brazil 


0.699 


1,422 




Germany 


0.885 


5,598 


Venezuela 


0.696 


2,739 




Japan 


0.884 


5,381 


Algeria 


0.677 


1,382 


C 


Indonesia 


0.6 


1,009 










South Africa 


0.597 


3,459 










Syria 


0.589 


1,307 










Vietnam 


0.572 


718.2 










Morocco 


0.567 


476 










India 


0.519 


682.4 










Pakistan 


0.49 


608.2 










Congo 


0.489 


363.1 










Kenya 


0.469 


640.9 










Bangladesh 


0.469 


214.4 






















To be 

Classified 


France 

Israel 

Greece 


0.872 
0.795 
0.855 


6,018 
3,156 
3,594 
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Analysis of Time Series Data 
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This chapter introduces several methods to analyze time 
series data in the time domain; an area rich in theoretical 
development and in practical applications. What constitutes 
time series data and some of the common trends encounte- 
red are first presented. This is followed by a description of 
three types of time-domain modeling and forecasting mo- 
dels. The first general class of models involves moving ave- 
rage smoothing techniques which are methods for removing 
rapid fluctuations in time series so that the general secular 
trend can be seen. The second class of models is similar to 
classical OLS regression models, but here the time variable 
appears as a regressor, thereby allowing the trend and the 
seasonal behavior in the data series to be captured by the 
model. Its strength lies in its ability to model the determi- 
nistic or structural trend of the data in a relatively simple 
manner. The third class of models called ARIMA models 
allows separating and modeling the systematic component 
of the model residuals from the purely random white noise 
element, thereby enhancing the prediction accuracy of the 
overall model. ARMAX models, which are extensions of 
the univariate ARIMA models to multivariate problems, and 
their ability to model dynamic systems are also discussed 
with illustrative examples. Finally, an overview is provided 
of a practical application involving control chart techniques 
which are extensively used for process and condition moni- 
toring of engineered systems. 



9.1 Basic Concepts 

9.1.1 Introduction 

Time series data is not merely data collected over time. If 
this definition were true, then almost any data set would qua- 
lify as time series data. There must be some sort of ordering, 
i.e. a relation between successive data observations. In other 
words, successive observations in time-series data are usu- 
ally not independent and their order or sequence needs to 
be maintained during the analysis. A collection of numerical 



observations arranged in a natural order with each observa- 
tion associated with a particular instant of time or interval 
of time which provides the ordering would qualify as time 
series data (Bloomfield 1976). One example is temperature 
measurements of an iron casting as it cools over time. The 
hourly variation of electricity use in a commercial building 
during the day and over a year would also qualify as time 
series data. Thus, the inherent behavior or the response of the 
system is affected by the time variable (either directly such 
as the cooling of a billet, or indirectly such as the electrici- 
ty use in a building). Note that "time" need not necessarily 
mean time in the physical sense, but any variable to which 
an ordering can be associated. Another more practical way 
of ascertaining whether the data is to be treated as time se- 
ries data or not, is to determine if the analysis results would 
change if the sequence of the data observations were to be 
scrambled. The importance of time series analysis is that it 
provides insights and more accurate modeling and prediction 
to time series data than do classical statistical analysis be- 
cause of the explicit manner in which the systematic residual 
behavior of the data is accounted for in the model. 

Consider Fig. 9. 1 where the hourly loads during a week of 
an electric utility are shown. The loads are highly influenced 
by those of residential and commercial buildings and indust- 
rial facilities within the utility's service territory. Because of 
distinct occupied and unoccupied schedules, there are both 
strong diurnal and weekly variations in the load. These loads 
are also affected by such variables as outdoor temperature 
and humidity. Hence, developing models which can predict 
or forecast, say a day ahead will allow electric utilities to bet- 
ter plan their operations. Figure 9.2 is another representation 
where the cyclic pattern (shown as dots) indicates the electric 
demand of an electric utility during each quarter of a 3 -year 
period. Using traditional ordinary least squares (OLS) cove- 
red in Chap. 5, one would obtain the mean of future predicti- 
ons (or forecasts) and the upper and lower confidence limits 
(CL) as shown. Analyzing data points in a time series frame- 
work would improve both the forecasts as well as reduce the 
confidence interval. Time series methods have been applied 
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Fig. 9.1 Daily peak and mini- 
mum liourly loads over several 
months for a large electric utility 70OO 
to illustrate the diurnal, the week- 
day/weekend and the seasonal 
fluctuations and trends 
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Fig. 9.2 Forecasts and prediction intervals of the quarterly electric de- 
mand of an electric utility using ordinary least square fit to historic 
data 



to numerous applications; to name a few, to data which exhi- 
bit periodicities, for condition monitoring of industrial pro- 
cesses (using control chart techniques), as well as allowing a 
systematic way of modeling dynamic systems. 

Similar to traditional data analysis, there are two kinds 
of time series analysis: (i) descriptive which uses graphi- 
cal and numerical techniques to provide the necessary un- 
derstanding, and (ii) inferential which allows future values 
to be forecast (or predicted) along with a measure of their 
confidence intervals. Both these aspects are complementary. 
Usually time series data (either from natural phenomenon 
such as, say, occurrence of sun spots, or for industrial pro- 
cess monitoring) need to be understood and modeled first 
prior to forecasting and, if possible, control. The forecast is 
not an end in itself, it is part of a larger issue such as taking 
corrective action. 



Time series analysis has several features in common with 
classical statistical analysis. If the physics of the system is 
not well understood and curve fitting is resorted to, the sub- 
jective element involved in trying to select an appropriate 
time series model for a given set of data is a major issue. Ot- 
her practical problems include missing observations, outliers 
or interruptions in the series due to a moment impulse ac- 
ting on the system. The analysis of time series data is further 
complicated by the possible presence of trend and seasonal 
variation which can be hard to estimate and/or remove. Fi- 
nally, inferences involving the non-stochastic' or determinis- 
tic trend are based on OLS where errors are assumed to be 
independent and uncorrelated (white noise), while in time 
series data analysis the errors are treated as being correlated. 
It is important to note that the OLS regression parameters 
identified from the data series data are unbiased per se, i.e., 
in the long run, an OLS regression model will yield the right 
average values of the parameters. However, the statistical 
significance of these parameters, i.e., the standard errors of 
the parameters will be improper, often resulting in an under- 
estimation of the confidence intervals. 

Time series can be analyzed in one of two ways: 

(a) Time domain analysis in which the behavior of a series is 
described in terms of the manner in which observations 
at different times are related statistically. This approach 
is usually more intuitive to beginners and is used almost 
exclusively in disciplines such as econometrics and so- 
cial science whose models are less deterministic; and 

(b) Frequency domain analysis which seeks to describe the 
fluctuations in one or more series in terms of sinusoidal 
behavior at various frequencies. This approach has been 



' A stochastic process is one which is described by a set of time indexed 
observations subject to probabilistic laws. 
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used extensively in the physical sciences especially en- 
gineering, physics, and astronomy. One can distinguish 
four sub-categories: 

(bl) Fourier series analysis, in its narrow sense, is the 
decomposition or approximation of a function into 
a sum of sinusoidal components (Bloomfield 1976). 
In its wider sense, Fourier series analysis is a pro- 
cedure that describes or measures the fluctuations 
in time series by comparing them with sinusoids 
when data exhibits clear periodic components. 
(b2) Harmonic analysis extends the capability of Fou- 
rier series analysis by allowing detection of perio- 
dic components or hidden periodicities in cases 
when the data do not appear periodic. 
(b3) Complex demodulation is a more flexible approach 
than harmonic analysis and is used to describe fea- 
tures in the data that would be missed by harmonic 
analysis, and also to verify, in some cases, that no 
such features exist. The price of this flexibility is a 
loss of precision in describing pure frequencies for 
which harmonic analysis is more exact. 
(b4) Spectral analysis describes the tendency for oscillati- 
ons of a given frequency to appear in the data, rather 
than the oscillations themselves. It is a modification 
of Fourier analysis so as to make it suitable for sto- 
chastic rather than deterministic functions of time. 
Frequency domain methods will not be treated in this 
book, and the interested reader can refer to several good texts 
such as Bloomfield (1976) and Chatfield (1989). 



9.1.2 Terminology 

Terminology and notations used in time series analysis differ 

somewhat from classical statistical analysis (Montgomery 

and lohnson 1976). 

(a) Types of data: A time series is continuous when observa- 
tions are made continuously in time, even if the measured 
variable take on only discrete set of values. A time series 
is said to be discrete when observations are taken only 
at specific times, usually equally spaced, even if the va- 
riable is continuous in nature (such as say, outdoor tem- 
perature). Other types of data (such as dividend paid by 
a company to the shareholders) are inherently discrete. 
Further, one distinguishes between two types of discrete 
data. Period or sampled data represent aggregate values 
of a parameter over a period of time, such as average tem- 
perature during a day. Point or instantaneous data repre- 
sent the value of a variable at specific time points, such 
as the temperature at noon. The difference between these 
two types of data has implications primarily for the type 
of data collection system to be used, and on the effect of 
measurement and data processing errors on the results. 



(b) Types of forecast: Forecasting (or prediction) is the tech- 
nique used to predict future values based upon past and 
present values (i.e. the estimation or model identifica- 
tion period) of the parameter in question. It is useful to 
distinguish between two types of forecast: point fore- 
casts where a single number is predicted in each forecast 
period, and interval forecasts where an interval or range 
is deduced over which the realized value is expected to 
lie. The latter provides a means of ascertaining predic- 
tion intervals (see Fig. 9.2). 

(c) Another type of distinction is expost and exante. In the ex- 
post forecast, the forecast period is such that observations 
of both the driving variables and the response variable are 
known with certainty. Thus, expost forecasts can be che- 
cked with existing data and provide a means of evaluating 
the model. An exante forecast predicts values of the re- 
sponse variable when those of the driving variables are: (i) 
known with certainty, referred to as conditional exante fo- 
recast, or (ii) not known with certainty, denoted as uncon- 
ditional exante forecast Thus, the unconditional forecast 
is more demanding than conditional forecasting since the 
driving variables need also to be predicted into the future 
(along with the associated uncertainty which it entails). 

(d) Types of forecast time elements. One needs also to dis- 
tinguish between the following three time periods. The 
forecasting period is the basic unit of time for which fo- 
recasts are made. For example, one may wish to forecast 
electricity use of a building on an hourly basis, i.e., the 
period is an hour. The forecasting horizon or lead time 
is the number of periods into the future covered by the 
forecast. If electricity use in a building over the next day 
is to be forecast, then the horizon is 24 h, broken down 
by hour. Finally, the forecasting interval is the frequen- 
cy with which new forecasts are prepared. Often this is 
the same as the forecasting period, so that forecasts are 
revised each period using the most recent period's value 
and other current information as the basis for revision. 
If the horizon is always the same length and the forecast 
is revised each period, one is then operating on a moving 
horizon basis. 



9.1 .3 Basic Behavior Patterns 

Time series data can exhibit different types of behavior pat- 
terns. One can envision various patterns, four of which are: 
(i) variations around a constant mean value, referred to as a 

constant process such as Fig. 9.3a, 
(ii) trend or secular behavior i.e. a long-term change in the 

mean level which may be linear, (such as Fig. 9.3b) or 

non-linear, 
(iii) cyclic or periodic (or seasonal) behavior such as 

Fig. 9.3c, and 
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Fig. 9.3 Different characteristics of time series, a Constant process, 
b linear trend, c cyclic variation, d impulse, e step function, f ramp. 
(From Montgomery and Johnson 1976 by permission of McGraw-Hill) 



(iv) transient behavior where one can have momentary im- 
pulse, such as Fig. 9.3d, or a step change (as in Fig. 9.3e) 
or a ramp up, i.e., where the increase is more gradual (as 
in Fig. 9.3f). 
Much of the challenge in time series analysis is distin- 
guishing these basic behavior patterns when they occur in 
conjunction. Untangling the data into these patterns requires 
a certain amount of experience and skill (specially because 
the decomposition is often not unique), only after which can 
model identification and forecasting be done with confiden- 
ce. The problem is compounded by the fact that processes 
may exhibit these patterns at different times. For example, 
the growth of bacteria in a pond may experience an exponen- 
tial growth followed by a stable constant regime, and finally, 
a declining trend phase. 



Table 9.1 Demand data for an electric utility. (From 
Benson 1988 by © permission of Pearson Education) 
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Year 
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Year 


Quarter 
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1974 


1 


68.8 


1980 


1 


130.6 




2 


65 




2 


116.8 




3 


88.4 




3 


144.2 




4 


69 




4 


123.3 


1975 


1 


83.6 


1981 


1 


142.3 




2 


69.7 




2 


124 
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90.2 
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146.1 




4 


72.5 




4 


135.5 


1976 


1 


106.8 


1982 


1 


147.1 
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89.2 
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119.3 
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110.7 
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138.2 
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91.7 
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127.6 
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108.6 
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143.4 
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134 




3 
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135.1 
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94.2 
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123.3 
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120.5 
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4 


107.4 
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139.4 


1979 


1 


116.2 


1985 


1 


151.6 




2 


104.4 




2 


133.7 




3 


131.7 




3 


154.5 




4 


117.9 




4 


135.1 



trend and the cyclic variation are the first steps in rendering 
the data stationary-, only after which can a time-domain mo- 
del be developed. In the frequency-domain approach, only 
the transient and trend patterns have to be removed from the 
basic time series data. 

The following is a simple example of time series data ex- 
hibiting both a linear trend and seasonal periodicities. This 
data set will be used to illustrate many of the modeling ap- 
proaches covered later in this chapter. 

Example 9.1 .1 : Modeling peak demand of an electric utility 
Table 9.1 assembles time series data of the peak demand or 
load for an electric utility for 1 2 years at quarterly levels (48 
data points in total). The data is shown graphically in Fig. 9.4 
revealing distinct long-term and seasonal trends. Most time 
series data sets may not exhibit such well-behaved patterns 
(and that is why this area of time series modeling is so rich 
and extensive); but this example is meant as a simple illus- 
tration. ■ 



9.1 .4 Illustrative Data Set 



Transient behavior in time series is indicative of a change in 
the basic dynamics of the phenomenon or process and has 
to be dealt with in a separate fashion. Removing the secular 



^ Stationarity in a time series strictly requires that all statistical descrip- 
tors, such as the mean, variance, correlation coefficients of the data se- 
ries be invariant in time. Due to the simplified treatment in this text, 
the discussion is geared primarily towards stabilizing the mean, i.e., 
removing the long-term and seasonal trends. 
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Fig. 9.4 Time series data of electric power demand by quarter (data 
from Table 9.1) 



9.2 General Model Formulations 

How does one model the behavior of the data shown in 
Example 9.1.1 and use it for extrapolation purposes? There 
are three general time domain approaches: 

(a) Smoothing methods, which are really meant to filter the 
data in a computationally simple manner. However, they 
can also be used for extrapolation purposes (covered in 
Sect. 9.3); 

(b) OLS models, which treat time series data as sectional 
data but with the time variable accounted for in an expli- 
cit manner as an independent variable (this is addressed 
in Sect. 9.4), and 

(c) The stochastic time series modeling, approach which 
explicitly treats the model residual errors of (b) by ad- 
ding a layer of sophistication; this is described briefly 
below, and at more length in Sect. 9.6). 

A basic distinguishing trait is that while approach (b) 
uses the observations directly, stochastic time series mode- 
ling deals with stationary data series which have been made 
so either by removing the trend by OLS modeling or by 
temporal differencing, or by normalizing and stabilizing the 
variance by suitable transformations if necessary. Thus, the 
first step is to remove the deterministic trend and periodic 
components of the time series; this is referred to as ma- 
king the series stationary. Let us illustrate the differences 
in approaches (b) and (c) in terms of an additive model. 
Heuristically: 

• For approach (b): Current value at time t = [deterministic 
component] + [residual random eiTor] = [constant + 
long-term trend + cyclic (or seasonal) trend] + [residual 
random error] 

• For approach (c): Cuixent value at time t = [deterministic 
component] + [stochastic component] = [constant + 
long-term trend + cyclic (or seasonal) trend] + [systematic 
stochastic component + white noise] 



where 

(a) the deterministic component includes the long-term (or 
secular) and seasonal trends taken to be independent of 
the error structure in the observations, and are usually 
identified by say, standard OLS regression. The para- 
meters of the model which explain the deterministic 
behavior of the time series yield the "long run" or ave- 
rage or expected values of the underlying process along 
with the cyclic variations, though one should expect a 
certain deviation of any given observation from the ex- 
pected value. It is the residual error (quantified, say as 
the standard error or the RMSE) which determines the 
uncertainty of the predictions; 

(b) the stochastic component treats the residual errors in a 
more refined sense by separating: (i) the systematic part 
in the errors responsible for the autocorrelation in time 
series data (if they exist). It is the determination of the 
structure of this systematic part which is the novelty in 
time series analysis and adds to the more accurate deter- 
mination of the uncertainty bands of prediction; and (ii) 
the white noise or purely random part of the stochastic 
component that cannot be captured in a model, and im- 
plicitly appears in the determination of prediction of the 
uncertainty bands. Thus, the stochastic methods exploit 
the dependency in successive observations to produce 
superior results in the prediction uncertainty determin- 
ation. Note the distinction made between random errors 
and white noise in the residuals. 

The deterministic models, called, trend and seasonal mo- 
dels are simple, easy to interpret, fairly robust and especi- 
ally suitable for data with pronounced trends and/or large 
seasonal effect. Moreover, these models are usually simpler 
to use and can be applied to relatively short data series (less 
than, say, 50 observations). Section 9.3 describes two widely 
used smoothing approaches: moving average and exponenti- 
al smoothing, both of which rely on data smoothening met- 
hods, i.e. methods for removing rapid fluctuations in time se- 
ries so that the general trend can be seen. Subsequently, how 
to adapt the OLS regression approach to model trend beha- 
vior as well as seasonal behavior is discussed in Sect. 9.4. 
These two general classes of models involve the time series 
observations themselves. The third, namely the stochastic 
approach addressed in Sect. 9.5, builds on the latter. In other 
words, it is brought to bear only after the trend and seasonal 
behavior is removed from the data. 



9.3 Smoothing Methods 

Time series data often exhibits local irregularities or rapid 
fluctuations resulting in trends that are hard to describe. Mo- 
ving average modeling is a way to smoothen out these fluc- 
tuations, thus making it easier to discern longer time trends 
and, thereby, allowing future or trend predictions to be made. 
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Fig. 9.5 Plots illustrating how 
two different AMA smoothing 
methods capture the electric 
utility load data denoted by 
MW(meas) (data from Table 9.1) 
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albeit in a simple manner. However, though they are useful 
in predicting mean future values, they do not provide any 
information about the uncertainty of these predictions since 
no modeling per se is involved, and so standard errors (which 
are the cause for forecast errors) cannot be estimated. The 
inability to quantify forecast errors is a serious deficiency. 

Two types of models are often used: arithmetic moving 
averages and exponential moving averages, which are de- 
scribed below. Both of these models are recursive in that pre- 
vious observations are used to predict future values. Further, 
they are linear in their parameters. 



SS = ^(Yt-bo)^ i.e., bo = -^Yt, 



t=i 
i.e. the arithmetic mean. 



N 



/=! 



(9.2) 



Since the model is used for forecasting purposes with AMA 
(N), one cannot obviously use future values. Let Y and Y, de- 
note observed and model-predicted values of Y at time inter- 
val t. Then, the following recursive equation for the average 
M, of the N most recent observations is used: 



9.3.1 Arithmetic Moving Average (AMA) 

Let Y(t), t={l,..., N...n} be time series observations at di- 
screte intervals of time at n time intervals. Instead of the func- 
tional notation, let us adhere to subscripts so that Y(t) = Yt. 
AMA models of order N (where N<n) denoted by AMA(N) 
combine N number of past, current and future values of the 
time series in order to perform a simple arithmetic average 
which slides on a moving horizon basis. A time series which 
is a constant process, i.e., without trend or cyclic behavior, 
is the simplest case one can consider. In this case, one can 
assume a model such as: 



Y,+i ^ M, = (Y, + Yt_i + 



Y,_N+i)/N (9.3a) 



Yt = bo + et 



(9.1) 



where e,(0, a^) is a random variable with mean and vari- 
ance a^ , and b^ is an unknown parameter. To forecast future 
values of the time series, the unknown parameter b^ is to be 
estimated. If all observations are equally important in esti- 
mating bjj, then the least-square criterion involves determin- 
ing the value which minimizes the sum of squares: 



or 



Mt = M,_i + (Yt - Yt_N)/N 



(9.3b) 



where for better clarity the notation M^ is used instead of 
Yt+i to denote model predicted forecasts one time interval 
ahead from time t. At each period, the oldest observation is 
discarded and the newest one added to the set; hence its name 
''N-period simple moving average". The choice of N, though 
important, is of course circumstance specific, but more im- 
portantly, largely subjective or adhoc. Large values of N 
produce a smoother moving average but more points at the 
beginning of the series are lost and the data series may wind 
up so smooth that incremental changes are lost. How simple 
3-point and 5-point AMA schemes capture the overall trend 
in the electric utility data is shown in Fig. 9.5, while Fig. 9.6 
shows the residuals pattern with time. As expected, one notes 
that AMA(3) smoothening has larger residuals spikes, i.e., is 
not as smooth as the AMA(5) but it is quicker to reflect chan- 
ges in the series. Which model is "better" cannot be ascertai- 
ned since it depends on the intent of the analysis: whether to 



9.3 Smoothing Methods 
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Fig. 9.6 Residual plots illus- 
trating how two different AMA 
smoothing methods capture the 
electric utility load data (data 
from Table 9.1) 
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capture longer-term trends or shorter ones. A cross-validation 
evaluation can be done as illustrated later by Example 9.5.3. 
It is clear that by taking simple moving averages, one 
overlooks the stochastic element in the data series. Its use 
results in a lag in Yt+i or M in the predicted values. Fur- 
ther, cyclic or seasonal behavior is not well treated unless 
the sliding window length is selected to be a multiple of the 
basic frequency. For the Example 9.1.4 data, a 4-point or a 
8-point sliding window would give proper weight to the fo- 
recasts. This lag, as well as higher order trends in the data, 
can be corrected by taking higher order MA methods, such 
as the moving average of moving averages, called a double 
moving average^. Such techniques as well as more sophis- 
ticated variants of AMA are available, but the same degree 
of forecast accuracy can be obtained by using exponentially 
weighted moving average models or the trend and seasonal 
models described below. 



9.3.2 Exponentially Weighted Moving 
Average (EWA) 

AMA is useful if the data trend indicates that future valu- 
es are simply averages of the past values. The AMA model 
is appropriate if one has reason to believe that the one-step 
forecast is likely to be equally influenced by the N previous 
observations. However, in case recent values influence future 
values more strongly than do past values, a weighting sche- 
me needs to be adopted, and this is the basis of EWA. Such 
smoothing models are widely used, their popularity stem- 
ming not only from their simplicity and computational effi- 
ciency but also from their ease of self-adjustment to changes 
in the process being forecast. Thus, they provide a means to 



adapt to changes in data trends which is superior to AMA. 
Modifying, Eq. 9.3a results in: 



Y,+i =Mt = aYt + (l-a)Mt_i 

= aY, -I- q;(1 - a)Yt_i + Q!(1 - afYt-i- 



(9.4) 



where 0<a<l is the exponential smoothing fraction. If 
a = 0.2, then the weights of the previous observations are 
0.16, 0.128, 0.1024 and so on. Note that normalization re- 
quires that the weights sum to unity, which holds true since: 



:^(l-a)'=(l-a) 



(=0 



i + E(i 



l-(l-a) 



/=o 
= 1 



(9.5) 



^ See for example McClave and Benson (1988) or Montgomery and 
Johnson (1976). 



A major drawback in exponential smoothing is that it is diffi- 
cult to select an "optimum" value of a without making some 
restrictive assumptions about the behavior of the time series 
data. Like many averages, the EWA series changes less ra- 
pidly than the time series itself. Note that the choice of a is 
critical, with smaller values giving more weight to the past 
values of the time series resulting in a smooth, slow chan- 
ging series of forecasts. Conversely choosing a closer to 1 
yields an EMA series closer to the original. Several valu- 
es of a should be tried in order to determine how sensitive 
the forecast series is to choice of a. In practice, a is chosen 
such that 0.01<a<0.30. Figure 9.7 illustrates how the two 
EWA schemes capture the overall trend, and how the choice 
of a affects the smoothing of the electric load data shown 
in Table 9.1. The residual plots are shown in Fig. 9.8 and, 
as expected, residuals for EWA(0.2) are lower in magnitude 
than those of EWA(0.5). 
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Fig. 9.7 Plots illustrating how 
two different EWA smoothing 
methods capture the electric 
utility load data denoted by 
MW(meas) (data from Table 9.1) 
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Fig. 9.8 Residual plots illus- 
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Fig. 9.9 Assumed weights for AMA5 and EWA (0.2) used to generate 
Figs. 9.5 and 9.7 



Exponential smoothing can be used to estimate the co- 
efficients in polynomial models of any degree. Though this 
approach could be used for higher order models, they get in- 
creasingly complex, and it is simpler and more convenient to 



use trend and seasonal models as well as stochastic models. 
Note that both the AMA and EWA models are procedures 
that adjust the future predicted value by an amount that is 
proportional to the most recent forecast error (see Fig. 9.9). 
This is the reason why forecasts based on these smoothing 
models fall under the class sometimes referred to as adaptive 
forecasting methods. 

Filtering is a generic term used to denote an operation 
where time series data is modified in a pre-determined man- 
ner so as to produce an output with emphasis on variation 
at particular frequencies. For example, a low pass filter is 
used to remove local fluctuations made up of high frequency 
variations. Thus, AMA and EWA can be viewed as different 
types of low-pass filters. 



9.4 OLS Regression Models 
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9.4 OLS Regression Models 

9.4.1 Trend Modeling 

Pure regression models can often be used for forecasting. 
OLS assumes that the model residuals or errors are indepen- 
dent random variables, implying that successive observati- 
ons are also considered to be independent. This is not strictly 
true in time series data, but assuming it to be so leads to 
a class of models which are often referred to as trend and 
seasonal time series models. In such a case, the time series 
is just a sophisticated method of extrapolation. For example, 
the simplest extrapolation model is given by the linear trend: 



Yt 



bii 



(9.6) 



where t is time, and Y is the value of Y at time t. This is akin 
to a simple linear regression model with time as a regressor. 
The interpretation of the coefficients b^ and b^ for both clas- 
sical regression and trend and seasonal time series analysis 
are identical. What differentiates both approaches is that, 
while OLS models are used to predict future movements in 
a variable by relating it to a set of other variables in a causal 
framework, the "pure" time series models are used to predict 
future movements using "time" as a surrogate variable. Mo- 
del parameters are sensitive to outliers, and to the first and 
last observations of the time series. 

To model a series whose rate of growth is proportional 
to its current value, the exponential growth model is more 
appropriate: 

Y, = &oexp(foit) (9.7a) 



which can be linearized by taking logs: 
InYt = ln/7o + ^iln(t) 



The model coefficients b^ and b^ can then be identified by 
least squares regression. 

Example 9.4.1: Modeling peak electric demand by a linear 
trend model 

The electric peak load data consisting of 48 data points given 
in Table 9. 1 can be regressed following a linear trend model 
(given by Eq. 9.6). The corresponding OLS model is: 

Y, = 77.906 -I- 1.624 t 

with R^ = 0.783,RMSE= 12.10 and CV= 10.3%. 

Figure 9.10 depicts the model residuals, from which one 
notes that the residuals are patterned, but more importantly, 
that there is a clear quadratic trend in the residuals as indi- 
cated by the trend line drawn. This suggests that the trend 
is not linear and that alternative functional forms should be 
investigated. ■ 



9.4.2 Trend and Seasonal Models 

In order to capture the deterministic seasonal trends in the 
data, a general additive time series model formulation analo- 
gous to a classical OLS regression model can be heuristically 
expressed as: 



Y,^bQ+bii,+b2ii+b^fi + 



b„t + s 



(9.8) 



(9.7b) 



where t is the time index, f is the cyclic or seasonal frequency 
and "b"s are the model coefficients. Note that the residual 
effect should ideally consist of a white noise element and a 
structured residual pattern (which is extremely difficult to 
remove). Modeling the structured portion of this residual 
pattern is the objective of stochastic modeling of time series 
data and is considered in the next section. 



Fig. 9.10 Figure illustrating 
that residuals for the linear trend 
model (Eq. 9.6) are not random 
(see Example 9.4.1). They exhi- 
bit both local systematic scatter 
as well as an overall pattern as 
shown by the quadratic trend 
line. They seem to exhibit larger 
scatter than the AMA residuals 
shown in Fig. 9.6 
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Fig. 9.11 Residuals for the linear 
and seasonal model (see Exam- 
ple 9.4.2). Note that the residuals 
still exhibit a pattern 
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Many time series data exhibit cyclic behavior, and one 
can use the indicator variable modeling approach (described 
in Sects. 5.7.2 and 5.7.3) in such cases. Consider the demand 
data for the electric utility shown in Table 9.1 and Fig. 9.4 
which has seasonal differences. Since quarterly data are 
available, the following model can be assumed to capture 
such seasonal behavior: 



Y, = /7o + bxh + b2h + bih + b^t 



(9.9) 



where t=time ranging from t=l (first quarter of 1974) to 
t=48 (last quarter of 1985), 

ly — \ for quarter = 1 
= for quarter = 2,3,4 

I2 = I for quarter — 2 
= for quarter — 1,3,4 

h — 1 for quarter — 3 
= for quarter — 1,2,4. 

When the time series is changing at an increasing rate over 
time, the multiplicative model is more appropriate: 

Yt = exp(foo + bxh+b2h + b^h + bit + s) (9.10a) 
or 

InYt ^bo + bih+bih + bih + b^t + s (9.10b) 

Example 9.4.2: Modeling peak electric demand by a linear 

trend plus seasonal model 

The linear trend plus seasonal model given by Eq. 9.9 has 

been fit to the electric utility load data yielding the following 

model: 



Yt =70.5085 + 13.6586/1 - 3.735912 
+ 18.4695/3 + 1. 6362 1 

with the following statistics: R^=0.914, RMSE = 7.86 and 
CV = 6.68%. 

Thus, the model R^ has clearly improved from 0.783 for 
the linear trend model to 0.914 for the linear-seasonal model, 
while the RMSE has decreased from 12.10 to 7.86. However, 
as shown in Fig. 9.1 1, the residuals are still not entirely ran- 
dom (the residuals of the data series at either end are lower), 
and one would investigate other models. ■ 

Unlike the case of smoothing methods (such as the AMA 
and EWA), a model is now being used, and this allows stan- 
dard model errors to be computed. Thus, one is able to es- 
timate the errors associated with predicting the individual 
as well as the mean value of future values at a forecasting 
period of one time step under a moving horizon basis. Since 
this approach does not attempt to deal with the systematic 
stochastic component of the residuals, the prediction uncer- 
tainty bands are similar to those of the regression models. 
However, instead of using the complete expression for in- 
dividual prediction bands given in Sect. 5.3.4 (specifically, 
Eqs. 5.15 and 5.16), the term associated with the slope para- 
meter uncertainty is usually dropped (note that one will be 
underpredicting the uncertainty a little). This would result in 
the following simplified expressions which can be assumed 
constant irrespective of the number of m future time steps of 
predictions: 
(a) individual predictions: 



Uncorrelated residuals: var(Yt+m) = cTe' I 1 + 



(9.11) 
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Table 9.2 Accuracy of the vai'ious modeling approaches when apphed 
to the electric utility load data given in Table 9. 1 (48 data points covering 
1974-1985). The RMSE correspond to internal prediction accuracy 
AMA(3) AMA(5) EWA(0.2) EWA(0.5) Linear Linear + 

Seasonal 



RMSE 7.68 



9.02 



8.59 



11.53 



12.10 7.86 



to use the model to forecast future values (to be illustrated in 
Example 9.5.2below). ■ 



9.4.3 Fourier Series Models for Periodic 
Behavior 



(b) mean values of say m future values: 



< / 1 
Uncorrelated residuals : var( < Yt+m > ) = — ttt I 1 H 



(9.12) 

Example 9.4.3: Comparison of different models for peak 
electric demand 

Let us use the time series data of the electric utility demand 
to compare the internal predictive accuracy of the various 
models. Since the model identification was done using 
OLS, there are no bias errors to within rounding errors of 
the computer program used. Hence, it is logical to base this 
evaluation on the RMSE statistics, which are given in Ta- 
ble 9.2. AMA(3) is surprisingly the best in that it has the 
lowest RMSE of all models followed very closely by the 
(linear + seasonal) model. The simple linear model has the 
highest residuals en^or. This illustrates that a blind OLS fit 
to the data is not recommended. However, the internal pre- 
diction accuracy is of limited use per se, our intention being 



It is clear that both the smoothing models (AMA and EWA) 
operate directly on the observation values {Y} unlike the 
stochastic time series models discussed in the next section. 
The trend and seasonal models using the OLS approach are 
analogous in that they model the structural component of the 
data. Another OLS modeling approach which achieves the 
same objective is to use basic Fourier series models. 

Note that the word "seasonal" used to describe time series 
data really implies "periodic". Thus, the Fourier series mode- 
ling approach applies to data which exhibit distinct periodic 
behavior which are known beforehand from the nature of the 
system, or which can be gleaned from plotting the data. For 
example, the data in Fig. 9. 1 exhibits strong weekly cycles 
while Fig. 9.12 has strong diurnal cycles. Recall that a periodic 
function is one which can be expressed as: 

f{t)^f{t + T) (9.13) 

where T is a constant called the period and is related to the 
frequency (or cycles/time) /= \IT, and to angular frequency 
m — 2n/T. This applies to any waveform such as sinusoidal, 
square, saw-tooth,.... For example, in Fig. 9.1, the period is 



Fig. 9.12 Measured hourly 
whole building electric use (ex- 
cluding cooling and heating rela- 
ted energy) for a large university 
building in central Texas (from 
Dhar et al. 1999) from January 
to June. The data shows distinct 
diurnal and weekly periodicities 
but no seasonal trend. Such beha- 
vior is referred to as weather- 
independent data. The residual 
data series using a pure sinusoi- 
dal model (Eq. 9.16) are also 
shown, a January-June, b April 
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Fig. 9.13 Measured hourly 
whole building cooling thermal 
energy use for the same building 
as in Fig. 9.12 (from Dhar et al. 
1999) from January to June. The 
data shows distinct diurnal and 
weekly periodicities as well as 
weather-dependency. The residu- 
al data series using a sinusoidal 
model with weather variables 
(Eq. 9.18) are also shown 
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1 week while the frequency would be 52 year '. A special 
case is the simple sinusoid function expressed as: 

y(t) = flo + «i ■ cos(a)f) -|-a2 ■ sin(tt)f) (9.14) 

where a^, a^ and a, are model parameters to be determined 
by OLS regression to data. Note that if the frequency m is 
known in advance, then, one sets x^ =cos(a»f) and x^ = '&v!\{a)i) 
which allows reducing Eq. 9.14 to a linear model: 



^(f) = Co + Gi ■ Xi -|- 02 ■ X2 



(9.15) 



The extension of this model to more than one frequency (de- 
noted by subscript j) is: 



tions by a Fourier series model identified by OLS are also 
shown in both figures to show the accuracy with which such 
models capture actual measured behavior. It must be noted 
that the time series data have been separated into three day- 
types: weekdays, weekends and holidays/semester break 
periods, and separate models have been fit to each of these 
periods. How to identify such periods statistically falls under 
the purview of classification, an issue which was addressed 
in Chap. 8. 

A general model formulation which can capture the di- 
urnal and seasonal periodicities as well as their interaction 
such as varying amplitude (see Fig. 9.14) is as follows: 



MM 



Z(fif) + y(/j) + Z(fif,/!) + £^,A (9.17) 



y(f) = iiQ + ^ Wj ■ COS {j (at) + bj ■ sin{ja)t)] (9.16) 

The above formulation is general, and can be modified as 
dictated by the specific situation. One such instance is the 
Fourier series model meant to describe hourly energy use in 
commercial buildings, as illustrated by the following exam- 
ple taken from Dhar et al. (1999). Whole building hourly 
energy use £ ,^^ for a large university building in central Te- 
xas is shown in Fig. 9.12a for six months (from January to 
June) and in Fig. 9.12b for the month of April only in or- 
der to better illustrate the periodicities in the data. The data 
channel includes building internal electric loads (lights and 
equipment) and electricity to operate the air-handlers but 
does not include any cooling or heating energy use. Hence, 
the data shows no long-term trend but distinct diurnal perio- 
dicities (occupied vs unoccupied hours) as well as weekly 
periodicities (weekday vs weekends). There are also abrupt 
drops in usage during certain times of year reflective of 
events such as semester breaks. On the other hand, Fig. 9.13 
is a time series plot of cooling thermal energy use for the 
same building. This energy use exhibits clear seasonal trend 
since more cooling is required as the weather warms up from 
January till June. Such loads are referred to as weather de- 
pendent loads by building energy professionals. Residual 
effects, i.e., differences between measured data and predic- 



where 



E27T 2tc 

Yj. sin — d + Sj. cos — a 



;=0 

7raax 



Pi 



Pi 



Pi 



E27T 27t 

Uj. sin — h + Pj. cos — h 

j=o ^J ^J 

Z(d,h) — \] yj I 4>i- sin — d + i/f,. cos — d 

;=0 7=0 ^ ' 

Ijt Itt 

rij. sin — h + f,. cos — h 

' Pj ' Pj . 

where h and d denote hourly and daily respectively, and 
P,= (365/0 and P. =(24//) for the annual and daily cyc- 
les respectively. Also, there are 24 hourly observations 
in a daily cycle and 365 days in a year which means that 
/ =(365/2-1)= 181 and/ = (24/2-1) = 11. These restricti- 

max ^ ' ^ max ^ ' 

ons would avoid the number of parameters in the model ex- 
ceeding the number of observations. Obviously, one would 
not use so many frequencies since much fewer frequencies 
should provide adequate fits to the data. 

Note that Y and X represent diurnal and seasonal periodi- 
cities respectively, while Z accounts for their interaction. In 
other words, Y alone will represent a load shape of constant 
mean and amplitude (shown in Fig. 9.14a). When term X is 
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Fig. 9.14 Three general types of periodic profiles applicable to model 
the pronounced diurnal and annual behavior of energy use in buildings 
(from Dhar et al. 1999). The energy use of a given hour h and day d has 
been inodeled following functional forms given by Eq. 9. 17. a Constant 
mean and amplitude, b varying inean but constant amplitude, c varying 
mean and amplitude. Specifically: 
(a) E,,_j = Y(h) = 1 - cos ((27r/24)/i) 

(b)E,,^d = Y(h) + X(d) = [1 -cos((27r/24)/i)] + [l - cos ((27r/24y)] 
(c) £*,rf = Y{h) + X(d) + Y(h).X(d) 



added to the model, variation in the mean energy use can be 
modeled (see Fig. 9. 14b). Addition of the term Z enables the 
model to represent shapes with varying mean and amplitude 
(Fig. 9.14c). Equation 9.17 is the general "seasonal" model 
for capturing effects which are related to time (for example, 
time of day or time of year). In the case of building energy 
use, these reflect the manner in which the building is sche- 



duled for operation. Energy control systems automatically 
switch on and off certain equipment consistent with how the 
building is occupied. However, there are other building lo- 
ads (such as cooling energy use shown in Fig. 9.13) which 
are affected by other variables. Typical variables are outdoor 
dry-bulb temperature (T^i^), absolute humidity of outdoor air 
(W) and solar radiation (S). Humidity affects the latent co- 
oling loads in a building only at higher humidity levels. Pre- 
vious studies (supported by theoretical considerations) have 
shown that this effect is linear, not with W but with humidity 
potential W* = {W- 0.0092)+ indicating that the absolute hu- 
midity effect is zero when W< 0.0092 and linear above that 
threshold value. In such a case, the most general formulation 
of the equivalent "linear and seasonal trend" model is: 



£rf,A = X!^-[^* + ^* + ^*] 



(9.18) 



where k= 1, T,,, W+ and S. 

The choice of which terms to retain in the model is done 
based on statistical tests or by stepwise regression (discus- 
sed in Sect. 5.7.4). Usually only a few terms are adequate. 
Table 9.3 assembles the results of fitting a Fourier series mo- 
del to the whole building electric hourly data for a whole 
year shown in Fig. 9.12 and to the cooling thermal energy 
use for the first six months of the year (shown in Fig. 9.13). 
The progressive improvement of the model R^ as stepwise 
regression is performed is indicative of the added contribu- 
tion of the associated variable in describing the variation in 
the dependent variable. The weather independent model is 
very good with high R^ and low CV. In the case of tempera- 
ture-dependent loads, the R^ is very good but the CV is rather 
poor (12%) with the residual values being about ±1 GJ/h 
(from Fig. 9.13). The ambient temperature variable is by far 
the most influential variable. 

The above case study is intended to demonstrate how the 
general Fourier series modeling framework can be modified 
or tailored to capture structural trend in time series data with 
distinct and clear periodicities. For other types of applicati- 
ons, one would expect different variables as well as different 
periodicities to be influential, and the model structure would 
have to be suitably altered. Spectral analysis is an extension 
of Fourier analysis as applied to stochastic processes with 
no obvious trend or seasonality; if there are obvious trends, 
these should be removed prior to spectral analysis. The spec- 
trum of a data series is a plot of the amplitude versus the 
angular frequency. Inspection of the spectrum allows one to 
detect hidden periodicities or features that show statistical 
regularity (see for example, Bloomfield 1976). 
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Table 9.3 Fourier series model results for building loads during the 
school-in-session weekdays (from Dhar et al. 1998). CHi (and SHi) and 
CDi (and SDi) represent the ith frequency of the cosine (and sine) terms 



of the diurnal and seasonal cycles respectively. The climatic variables 
are ambient dry-bulb temperature (T^^^), absolute humidity potential 
(W*) and horizontal solar radiation (S) 



No. of parameters Weather 


independent (one 


year) — Fig. 


9.12 




Weather dependent (6 months) — ^Fig. 


9.13 




Variable 


Partial R- 


Cumulative R^ 


CV (%) 


Variable 


Partial R^ 


Cumulative R- 


CV (%) 


2 CHI 


0.609 


0.609 




- 


T 


0.804 


0.804 




- 


3 SHI 


0.267 


0.876 




- 


W* 


0.062 


0.866 




- 


4 CH2 


0.041 


0.918 




- 


T,,*CH1 


0.179 


0.888 




- 


5 SH4 


0.012 


0.927 




- 


S*CH1 


0.008 


0.892 




12.08 


6 SH3 


0.007 


0.937 




- 












7 SDl 


0.006 


0.943 




3.8 













9.4.4 Interrupted Time Series 

The time series data considered above was representative of 
a system which behaved predictably, albeit with some noise 
or unexplained variation, but without any abrupt changes in 
the system dynamics. Since the systems under interest are 
often dynamic entities, their behavior changes in time due 
to some specific cause (or intervention), and this gives rise 
to interrupted time series. Often, such changes can be detec- 
ted fairly easily, and should be removed as part of the trend 
and seasonal modeling phase. In many cases, this can be 
done via OLS modeling or some simple transformation; two 
simple cases will be briefly discussed below. More complex 
interventions cannot be removed by such simple measures 
and should be treated in the framework of transfer functions 
(Sect. 9.6). 

9.4.4.1 Abrupt One-Time Constant Change 
in Time 

This is the simplest type of intervention arising when the 
time series data abruptly changes its mean value and assu- 
mes a new mean level. If the time at which the intervention 
took place is known, one can simply recast the trend and sea- 
sonal model to include an indicator variable (see Sect. 5.7.3) 
such that 1=0 before the intervention and 1=1 afterwards (or 
vice versa). 

The case of an abrupt operational change in the heating 
and ventilation air-conditioning (HVAC) system of a com- 
mercial building is illustrated by the following example 
(Ruch et al. 1999). Synthetic energy data (E^) for day t was 
generated using 91 days of real daily outdoor dry-bulb tem- 
perature data T from central Texas according to the follo- 
wing model: 

£, = 2,000 +100r, + 1,500/ + e, (9.19) 

where the indicator variable 

1=1 for days t=l, 2,..., 61, and 

I=0fordayst = 62, 63,..., 91 

and e is the error term assumed normally distributed. 



This hypothetical building has been assumed to undergo 
an energy saving operational change on the 62nd day, the 
result being a 1,500 unit shift (or decrease) in energy use at 
the change point. In this case, since the shift is in the mean 
value only, the slope of the model is constant over the ent- 
ire period. A superficial glance at a scatter plot of the data 
(Fig. 9.15a) suggests a single linear model, though a closer 
look at the residuals against time would have revealed a shift. 
A simple OLS fit gave a reasonable fit (R^ = 0.84) but the re- 
sidual autocorrelation was significant (/3 = 0.80). The model 
with indicator variables to account for the change-point in 
time behavior, resulted in a much better fit of R^ = 0.98 and 
negligible autocorrelation (see Fig. 9.15b). 

9.4.4.2 Gradual and Constant Change over Time 

Another type of common intervention is when the change is 
not abrupt but gradual and constant. In the framework of the 
example given above, the energy use in the building "creeps" 
up in time due to such causes as increase of equipment (more 
computers, printers,. . .) or gradual degradation in performan- 
ce of the HVAC system. Such energy creep is widely obser- 
ved in buildings and has been well-documented in several 
publications. Again, one has to remove the structural portion 
of the time series so as to de-trend it. A simple approach is to 
simply modify the model given by Eq. 9.19 when the degra- 
dation is related to temperature as follows: 



E, 



■bT,+cIT,+Sr 



(9.20) 



where the indicator variable is now applied to the slope of the 
model, and not to the intercept, such that: 
1 = before the onset of energy creep, and 
1=1 after the onset. 

Interrupted time series can arise due to various types of 
interventions. This section presented some simple cases 
when the interventions were one-time occurrences and the 
time of their onset was known. There are more complex ty- 
pes of interventions which are due to known forcing functi- 
ons which change in time, and such cases are treated using 
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Fig. 9.15 Improvement in OLS 
model fit wlien an indicator 
vaiiable is introduced to capture 
abrupt one-time cliange in energy 
use in a building, a Ordinary 
least square (OLS) model, b 
Indicator variable model (IND). 
(From Ruch et al. 1999) 
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transfer function models (Sect. 9.6). The reader can also re- 
fer to McCain and McCleary (1979) for more discussion on 
interrupted time series analysis. 



9.5 Stochastic Time Series Models 

9.5.1 Introduction 

The strength of the classical regression techniques (such 
as OLS discussed in Chap. 5) is the ability to capture the 
deterministic or structural trend of the data, while the mo- 
del residuals are treated as random errors which impact the 
uncertainty limits of future predictions. In time series data, 
the residuals are often patterned, i.e., have a serial coiTela- 
tion which is difficult to remove using classical regression 
methods. The strength of the time series techniques is their 



ability to first detect whether the model residuals are purely 
random or not, i.e., whether white noise. If they are, clas- 
sical regression methods are adequate. If not, the residual 
errors are separated into a systematic stochastic component 
and white noise. The former is treated by stochastic time se- 
ries models such as AR, MA, ARMA, ARIMA and ARMAX 
(these terms will be explained below) which are linear in 
both model and parameters, and hence, simplify the para- 
meter estimation process. Such an approach usually allows 
more accurate predictions than classical regression (i.e., nar- 
rower uncertainty bands around the model predictions), and 
therein lies their appeal. Once it is deemed that a time series 
modeling approach is appropriate for the situation at hand, 
three separate issues are involved similar to OLS modeling: 
(i) identification of the order of the model (i.e., model struc- 
ture), (ii) estimation of the model parameters (parameter es- 
timation), and (iii) ascertaining uncertainty in the forecasts. 
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It must be noted that time series models may not always be 
superior to the standard OLS methods even when dealing 
with time series data; hence, the analyst should evaluate vari- 
ous models in terms of their predictive ability by performing 
a sample holdout cross-validation evaluation (described in 
Sect. 5.3.2d). 

There is a rich literature on stochastic time series models 
as pertinent to various disciplines, and this section should 
be regarded as an introduction to this area (see texts such as 
Box and Jenkins 1976 dealing with engineering applications 
and Pindyck and Rubinfeld 1981 dealing with econometric 
applications). Some professionals view stochastic time series 
analysis as being too mathematically involved to be of much 
use for practical applications; this is no more true than any 
statistical method. An understanding of the basic principles 
and of the functionality of the different forms of ARMAX 
models, the willingness to learn by way of practice along 
with familiarity in using an appropriate statistical software 
are all that is needed to be able to add time series analysis to 
one's toolbox. 



9.5.2 ACF, PACF and Data Detrending 

9.5.2.1 Autocorrelation Function (ACF) 

The concept of the ordinary correlation coefficient between 
two variables has already been introduced in Sect. 3.4.2. 
This concept can be extended to time series data to ascertain 
if successive observations are correlated. Let {Y^, ...Y J be 
a discrete de-trended time series data where the long-term 
trend and seasonal variation has been removed. Then (n- 1) 
pairs of observations can be formed, namely (Y^, Y,), (Y,, 
Yj),...(Y^^_j, Y^^). An autocorrelation (or serial correlation) 
coefficient r^ measures the extent to which successive obser- 
vations are interdependent and is given by: 



ri = 



E (Yt - Y)(Y,+i - Y) 
t=\ 

E(Yt-Y)' 
t=i 



(9.21) 



where Y is the overall mean. 

If the data is cyclic, this behavior will not be captured 
by the first order autocorrelation coefficient. One needs to 
introduce, by extension, the serial correlation at lag k, i.e., 
between observations k apart: 



rk = 



1 \ n-k 

r E (Yt - Y)(Y,+k - Y) 

'^~'^/ '=1 ^ '^ (9.22) 

1 \ " - 2 

7 E(Yt-Y)' 

n - 1 / t=i 



Co 



where n is the number of data points and Cj^ and c^ are the 
autocovariance coefficients at lag k and lag zero respecti- 
vely. Though it can be calculated for lags of any size, usu- 
ally it is inadvisable to calculate r^ for values of k greater 
than about (n/4) (Chatfield 1989). A value of r^^ close to zero 
would imply little or no relationship between observations k 
lags apart. The autocorrelation function (ACF) is a function 
which represents the variation of r^^ with lag k. Usually, there 
is no need to fit a functional equation, but a graphical repre- 
sentation called the correlogram is a useful means to provide 
insights both into model development and to evaluate whet- 
her stationarity (i.e., detrending by removal of long-term 
trend and periodic/cyclic variation) has been achieved or not. 
It is clear that the ACF is an extension of the Durbin-Watson 
(DB) statistic presented in Sect. 5.6.1 which relates to one 
lag only, i.e., k=l. 

Figure 9.16 illustrates how the ACF varies for four diffe- 
rent numerical values of v^ (for both positive and negative va- 
lues) assuming only first-order autocorrelation to be present. 
One notes that all curves are asymptotic, dropping to zero 
faster for the weaker correlations, while plots with negati- 
ve correlation fluctuate on either side of zero as they drop 
towards zero. The close similarity between these plots and 
those of the Pearson correlation coefficient (Sect. 3.4.2 and 
Fig. 3.13) should be noted. Because r^, is normalized with the 
value at lag k=0, ACF at lag is unity, i.e., \:^^= 1. If the data 
series were non-stationary, these plots would not asymptote 
towards zero (as illustrated in Fig. 9.17). Thus, stationarity is 
easily verified via the ACF. 

The standard error for r^, is calculated under the assump- 
tion that the autocon-elations have died out till lag k using: 



o{rk) 




1/2 



(9.23) 



where n is the number of observations. The corresponding 
confidence intervals at the selected significance level a are 
given by: ±Zq./2 ■ cr(r,t). Any sample correlation which 
falls outside these limits is deemed to be statistically insig- 
nificant at the selected significant level. The primary value 
of the correlogram is that it is a convenient graphical way 
of determining the number of lags after which correlation 
coefficients are insignificant. This is useful in identifying the 
appropriate MA model (discussed in Sect. 9.5.3). 

9.5.2.2 Partial Autocorrelation Function (PACF) 

The limitation of the ACF as a statistical measure suggestive 
of the order of the model meant to fit the detrended data is 
that the magnitude of r^ carries over to subsequent r. values. 
In other words, the autocorrelation at small lags carries over 
to larger lags. Consider a first-order model with rj = 0.8 (see 
Fig. 9.16c). One would expect the ACF to decay exponenti- 
ally. Thus, r2=0.8-=0.64 and r3=0.8'=0.512 and so on, even 
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Fig. 9.16 Correlograms of the 
first-order ACF for different 
magnitudes of the correlation 
coefficient, a Weak positive. 
b Moderate positive, c Strong 
positive, d Strong negative 
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Fig. 9.17 Sample correlogram for a time series which is non-stationary 
since the ACF does not seem to asymptote to zero 

if no second order lags are present. When second order ef- 
fects are present, the decay is also close to asymptotic, which 
clouds the identification of the order of the model describing 
the time series process. A statistical quantity, by which one 
can measure the excess correlation that remains at lag k after 
a model of order (k- 1 ) is fit to data, is called the partial auto- 
correlation function (PACF) (/j^^. Thus, the ACF of a time 
series tends to taper off as lag k increases, while the PACF 
cuts off after a certain lag is reached. 



The standard error for cj) is given by: 



cr{(pkk) 



1/2 



(9.24) 



where n is the number of observations, while the correspon- 
ding confidence intervals at significance level a are given by 
0±Zq./2 ■ a{(pkk)- 

The PACF is particularly useful as an aid in determin- 
ing the order of the AR model as discussed in Sect. 9.5.3. It 
finds application in problems where it is necessary to deter- 
mine the order of the ODE needed to model dynamic system 
behavior. 

Example 9.5.1: The ACF function applied to peak electric 

demand 

Consider the data assumed earlier in Example 9.1.1 and 

shown in Table 9.1. One would speculate that four lags are 

important because the data is taken quarterly (four times a 

year). How the ACF function is able to detect this effect will 

be illustrated below. 

The data is plotted in Fig. 9. 1 . A commercial software pa- 
ckage has been used to generate the ACF shown in Fig. 9.18 
while Table 9.4 shows the estimated autocorrelations bet- 
ween values of electric power at various lags (only till 8 lags 
are shown) along with their standard errors and the 95% pro- 
bability limits around 0.0. The lag k autocorrelation coef- 
ficient measures the correlation between values of electric 
power at time t and time (t-k). If the probability limits at a 
particular lag do not contain the estimated coefficient, there 
is a statistically significant correlation at that lag at the 95% 
confidence level. In this case, 4 of the 24 autocorrelation co- 
efficients are statistically significant at the 95% confidence 
level (shown in italics in the table), implying that the time 
series is not completely random (white noise) — this is con- 
sistent with the result expected. ■ 
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Fig. 9.18 ACF and PACF plots 
for the time series data given in 
Table 9. 1 along with their 95% 
uncertainty bands. Note the 
asymptotic behavior of the ACF 
and the abrupt cutoff of the PACF 
after a finite number of lags 
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Table 9.4 Estimated autocorrelation coefficients till lag 8 and associa- 
ted uncertainty 



Lag 


Autocorre- 
lation 


Stnd. error 


Lower 95.0% 
prob. limit 


Upper 95.0% 
prob. limit 


1 


0.679357 


0.144338 


-0.282897 


0.282897 


2 


0.829781 


0.200159 


-0.392305 


0.392305 


3 


0.571588 


0.262207 


-0.513918 


0.513918 


4 


0.737873 


0.286994 


-0.562499 


0.562499 


5 


0.462624 


0.324116 


-0.635257 


0.635257 


6 


0.600009 


0.337593 


-0.661671 


0.661671 


7 


0.358554 


0.359123 


-0.70387 


0.70387 


8 


0.48272 


0.366505 


-0.718338 


0.718338 



9.5.2.3 Detrending Data by Differencing 

Detrending is the process by which the deterministic trend 
in the data can be removed or filtered out. One could use a 
regression model to achieve this, but a simpler method, and 
one which yields insight into the order of the time series mo- 
del, is differencing. For data series that do not have cyclic 
variation (i.e. non-seasonal data), differencing can make a 
non-stationary time series stationary. A backward first-order 
difference that can remove a linear trend is: 



VY,+i = Yt+i - Yt 
where V is the backward difference operator. 



(9.25a) 



Similarly, the second order differencing to remove a qua- 
dratic trend is: 



VY,+2 = (Yt+i - Y.) - (Y, - Y,_i) 
= Yt+i-2Y, + Yt_i) 



(9.25b) 



Thus, differencing a time series is akin to finite differencing 
a derivative. The time series data in Fig. 9.19a is quadratic, 
with the first differencing making it linear (see Fig. 9.19b) 
and the second differencing (Fig. 9.19c) making it constant, 
i.e. totally without any trend. This is, of course, a simple 
example, and actual data will not detrend so cleanly. 

Usually, not more than a second order sequencing is nee- 
ded to make a time series stationary provided no seasonal 
trend is present. In case this is not so, a log transform should 
be investigated. Let us illustrate the above concepts using 
very simple examples (McCain and McCleary 1979). Consi- 
der the series { 1, 2, 3, 4, 5. . ., N}. Differencing this sequence 
results in{l, 1, 1, 1, l,...l} which is stationary. Hence, a first 
order sequencing is adequate. Now consider the sequence 
{2, 4, 8, 16, 32,... 2^^}. No matter how many times this se- 
quence is differenced, it will remain nonstationary. Let us log- 
transform the sequence as follows {l(ln 2), 2(ln 2), 3(ln 2), 
4(ln 2), . . . .N(ln 2) } . If this sequence is differenced just once, 
one gets a stationary sequence. Series which require a log- 
transform to make them stationary are called "explosively" 
nonstationary. They are not as common in practice as linear 
time series data, but when they do appear, they are easy to 
detect. 
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Fig. 9.19 a-c Simple illustration of how successive differencing can reduce a non-stationary time series to a stationary one (function assumed: 
x = 10f') 
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For time series data that have periodic (or seasonal) varia- 
bility, seasonal differencing has to be done, albeit appropria- 
tely. For example, time series data of hourly electricity use in 
a building exhibits strong diurnal variability which is, howe- 
ver, fairly repetitive from one day to the next. Differencing 
hourly data 24 h apart would be an obvious way of making 
the series close to stationary. Thus, one employs the operator 
V,, and obtains V,T =Y -Y ,, which is likely to detrend the 

24 24 t t t-24 -' 

data series. Though visual inspection is one way of determin- 
ing whether a data has become stationary or not, a better way 
is to use statistical tools such as the correlogram. 

Another question is whether detrending a time series by 
differencing it changes any of the deterministic parameters 
that describe the processes. The answer is that differencing 
does not affect the parameters but only affects the manner 
in which they are represented in the model. Consider the se- 
quence {2, 4, 6, 8..., 2N}. Clearly this has a linear secular 
trend which can be modeled by the equation: Y =2.t, i.e., a 
linear model with slope 2. However, one can also represent 
the same series by the equation: 

Yt = Y,_i + 2 

Thus, the explicit trend model and the first-order linear diffe- 
rence equations both retain the parameter as is, but the para- 
meter appears in different forms in both equations. 



9.5.3 ARIMA Models 

The ARIMA (Auto Regressive Integrated Moving Average) 
model formulation is a general linear framework which ac- 
tually consists of three sub-models: the autoregressive (AR), 
the integrated (I), and the moving average (MA). It is a mo- 
del linear in its parameters which simplifies the estimation. 
The integrated component is meant to render the time se- 
ries data stationary, while the MA and AR components are 
meant to address the stochastic element. Often, time series 
data have mean values that are dependent upon time (drift 
upwards or downwards), have non-constant variance in the 
random shocks that drive the process, or they possess a sea- 
sonal trend. A seasonal trend is reflective of autocorrelation 
at a large lag, and if the cyclic patterns are known, they can 
be detrended as described earlier. The non-constant variance 
violation is often handled by transforming the data in some 
fashion (refer to Chatfield 1989). ARIMA models are expres- 
sed as ARIMA (p, d, q) where p is the order of AR compo- 
nent, q is the order of the MA component and d is the number 
of differencing taken to make the time series data stationary"*. 
Thus, ARIMA (p, 0, q) implies that the time series is already 
stationary and that no differencing need be done. Similarly, 



ARIMA(p, 2, q) denotes that differencing needed to be done 
twice to make the time series data stationary. 

Consider a system which could be an engineering system 
with well-defined behavior, or a complex system (such as 
the stock market). The current response of a dynamic system 
depends on its past value. Think of the cooling of a hot water 
tank where the temperature at time t is necessarily a function 
of the temperature at time (t- 1) as well as on the "shocks" 
or random perturbations to which it is subjected to. The AR 
model captures the former element, i.e., the "memory" or 
past behavior of the system by expressing the series residuals 
at the current time as a linear function of p past residuals. 
The MA models capture the random shocks or perturbation 
on the system (which do not persist) via a linear function of 
q past white noise errors. The AR component is dominant in 
systems which are fairly deterministic and with direct relati- 
onship between adjacent observations (such as the tank cool- 
down example). The order of p is directly related to the order 
of the differential equation of the white-box model which 
will adequately model the system behavior (see Sect. 9.6.3). 
The MA component is a special type of low-pass filter which 
is a generalization of the EWA smoothing approach descri- 
bed in Sect. 9.3.2. 

The MA and AR models are obviously special cases of 
the ARMA formulation. Unlike OLS type models, ARMA 
models require relatively long data series for parameter es- 
timation (about a minimum of 50 data points and prefera- 
bly 100 data points or more) and are based on linear model 
formulation. However, they provide very accurate forecasts 
and offer both a formal and a structured approach to model 
building and analysis. Several texts suggest that, in most ca- 
ses, it is not necessary to include both the AR and the MA 
elements; one of these two, depending on system behavior, 
should suffice. Further, it is recommended that analysts new 
to this field limit themselves to low order models of 3 or less 
provided the seasonality has been properly filtered out. 

9.5.3.1 ARMA Models 

Let us consider a discrete random time series data which is 
stationary but serially dependent, and is represented by the 
series {Z}. Let {a } be a white noise or "purely" random 
series, also referred to as random shocks or innovations. The 
ARIMA (p, 0, q) model is then written as: 

(p'oZ, = (/);Z,_i + 02^,_2 + ■ ■ ■ + 0pZ,_p 

where {0,'} and {«■} are the weights on the {ZJ and {a^} 
series respectively. The weights are usually scaled by setting 
(/)q = 1 and co'q — 1, so that the general ARIMA(p, 0, q) for- 
mulation is: 



■* Stationarity of a stochastic process can be interpreted qualitatively as 
a process which is in statistical equilibrium. 



Z, = 01 Z, 



iZ,- 



+ 4>nZ,- 



1 -I- V2^t-2 -1 r vp^t-p (9.26b) 

a, + (D\a,^x + (L>2a,^2 H 1- «?«/-? 



272 



9 Analysis of Time Series Data 



9.5.3.2 MA Models 

The first order moving average model or MA(1) represents 
a linear system subjected to a shock in the first time interval 
only and which does not persist over subsequent time peri- 
ods. Following Box and Luceno (1997), it is written as: 



Z, — ai + coiai^i—at 



■a,_i 



(9.27) 



where the coefficient 9= 



is introduced by convention 



and represents the weighted portion of the previous shock 
attime(t-l). 
Thus: 



Z,_i — at-\ — 9 ■ a,_2, ■Z,_2 = ai-2 



■a,-3 



and so on. 

The general expression for a MA model of order q, i.e., 
MA (q) model, is expressed as: 



Z, — a, 



■ «/-! — &2 ■ a,-.2 — . . 



(9.28) 



The white noise terms a are often modeled as a normal dis- 

1 

tribution with zero mean and standard deviation a, or N{0,(t) 
and, hence, the process given by Eq. 9.28 will fluctuate 
around zero. If a process with a non-zero mean fi but without 
any trend or seasonality is to be modeled, a constant term 
c is introduced in the model such that c=fi. An example of 
a MA(1) process with mean 10 and 6*1 = -0.9 is depicted in 
Fig. 9.20a where a set of 100 data points have been genera- 
ted in a spreadsheet program using the model shown with a 
random number generator A^(0,1) for the white noise term. 
Since this is a first order model, the ACF should have only 
one significant value (this is seen in Fig. 9.20b where ACF 
for greater lags fall inside the 95% confidence intervals). Ide- 
ally, there should only be one spike at lag k= 1, but because 
random noise was introduced in the synthetic data, this obfu- 
scates the estimation, and spikes at other lags appear which, 
however, are statistically insignificant. 

For MA models, the ACF can be deduced from the model 
coefficients (Montgomery and Johnson 1976): 



ForMA(l): r* = 



-5^ for k = 1 and r^^ = for k > 1 

(9.29) 



and 



-6»i(l -61) 
for MA(2): rj = — —, 4 and 



r2 



1 



3^ + ej 



and r^ = for k > 2 



For the above MA(1) example, r, = (0.9)/1h-(0.9)^ = 0.497 
which is what is indicated in Fig. 9.20b. 

The PACF function alternates with lag term but damps out 
exponentially (Fig. 9.20c). MA processes are not very com- 
mon in engineering, but they are often used in areas where 
the origin of the shocks is unexpected. For example, in eco- 
nometrics, events such as strikes and government decisions 
are modeled as white noise. 

9.5.3.3 AR Models 

Autoregressive models of the order p or AR(p) models are 
often adopted in engineering applications. The first-order 
AR(1) model is written as (Box and Luceno 1997): 



Z, = 01 ■ Z,_i -I- a, 



(9.30) 
1 < 1. The 



where (p^ is the ACF at lag 1 such that — 1 < 
AR(1) model, also called a Markov process is often used 
to model physical processes. A special case is when 0j= 1, 
which represents another well known process called the ran- 
dom walk model. If a process with a non-zero mean fi is to be 
modeled, a constant term c is introduced in the model such 
that c =^(1-0|). 

Attime(t-l):Z 
back into Eq. 9.30, yields 



!)j.Z 2 + '^,-! which, when substituted 



Z, — a, + I 



fl,-i -I- 1 



Z,- 



(9.31) 



Eventually, by successive substitution, Z can be expressed 
as an infinite-order MA. From the viewpoint of a compact 
model, an AR model approach is, thus, superior to the MA 
modeling approach. 

An AR(2) process is written as: 



Zr 



Zi-i + (pi ■ Z,_2 + at 



(9.32) 



with the conditions that: 
-1 <(/),< 1 



(01+(/)2)<l,(02-</'l)<l, 



Fig. 9.20 a-c One realization of 
a MA(1) process fovZ = 10 + £j + 
0.9£ I along with corresponding 
ACF and PACF with error term 
being Normal(0,l) 
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Again, if a process with a non-zero mean fi is to be mo- 
deled, a constant term c is introduced in the model such that 

C=^(l-0-(/),). 

By extension, an AR(p) model will assume the form: 



Z, = ^1 • Z,_i + 02 • ■Z,_2 + ■ 



■Z,_p+a, (9.33) 



For AR(1): ACF is r^ — (p\ (exponential decay) 



and PACF is 0n = r^ (9.34) 



and 



AR(2): ACFisro = 1, n 



1-02 

rk = (/'la-i + (pirk-i for k > 

while R^CF is (/>,, = '", and (f)^^ requires an iterative solution. 

There are no simple formulae to derive the PACF for or- 
ders higher than 2, and hence software programs involving 
iterative equations, known as the Yule-Walker equations, are 
used to estimate the parameters of the AR model (see for 
example. Box and Luceno 1997). 

Two processes, one for AR(1) with a positive coefficient 
cind the other for AR(2) with one positive and one negative 
coefficients are shown in Figs. 9.21 and 9.22 along with their 
respective ACF and PACF plots. The ACF function, though it 
is a model of order one, dies down exponentially, and this is 
where the PACF is useful. Only one PACF term is statistically 



significant at the 95% significance level for ARl in Fig. 9.21 
while two terms are so in Fig. 9.22 (as it should be). The pro- 
cess mean line for AR(1) and the constant term appearing in 
the model are related: fi = c/(l -</)) = 5/(1 -0.8) = 25 which 
is consistent with the process behavior shown in Fig. 9.21a. 

For AR(2), the process mean line is ^i = c/(l - 0j - (f)^ = 
25/(1 - 0.8 + 0.8) also consistent with Fig. 9.22a. For the 
ACF: r, = 0.8/(1 -(-0.8)) = 0.44 and r^= -0.8 = (0.8)V(1 - 
(- 0.8)) = -0.44. The latter value is, slightly different from 
the value of about -0.5 shown in Fig. 9.22b which is due 
to the white noise introduced in the synthetic sequence. 

Finally, Fig. 9.23 illustrates an ARMA (1,1) process whe- 
re elements of both MA(1) and AR(1) processes are present: 
exponential damping of both the ACF and the PACF. 

These four sets of figures (Figs. 9.20-9.23) partially 
illustrate the fact that one can model a stochastic process 
using different models, a dilemma which one faces even 
when identifying classical OLS models. Hence, evaluation 
of competing models using a cross-validation sample is high 
advisable as well as investigating whether there is any cor- 
relation structure left in the residuals of the series after the 
stochastic model effect has been removed. These tests clo- 
sely parallel those which one would perform during OLS 
regression. 

9.5.3.4 Identification and Forecasting 

One could use the entire ARIMA model structure as descri- 
bed above to identify a complete model. However, with the 



Fig. 9.21 a-c One realization of a 
AR(1) process forZ,=5 + 0.8Z, ,+e, 
along with corresponding ACF 
and PACF with error term being 
Non-nal(0,l) 
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Fig. 9.22 a-c One realization 
of a AR(2) process for Z=25 + 
0.8Z_i -0.8Z_2 + £ along with 
corresponding ACF and PACF 
with error term being Nor- 
mal(0,l) 




[ 








^■::iW 


k 


rS 


r^ 


-o^r^l 

-it. 









to rs: 
lag 



0^ 

-as 



^ 



ir^'-f^ 



v> IS 

lag 



Fig. 9.23 One realization 
of a ARMA( 1,1) process for 
Z,= 15 + 0.8Z,_,+£,+0.9£,_, along 
with corresponding ACF and 
PACF with error term being 
Normal(0,l) 
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intention of simplifying the process of model identification, 
and in recognition of the importance of AR models, let us 
limit the scope to AR models only. The procedure for iden- 
tifying an AR(p) model with n data points, and then using it 
for forecasting purposes are summarized below. 

To Identify a AR(p) Model: 

(i) Evaluate different trend and seasonal models using 
OLS, and identify the best one based on internal and 
external predictive accuracies: 



Y, =bo + bi ■ xij H \-bk ■ Xk,, 



(9.35a) 



where the x terms are regressors which account for the 
trend and seasonal variation. 

(ii) Calculate the residual series as the difference between 
observed and predicted: {ZJ ={Y^-Y,^ 

(iii) Determine the ACFs of the residual series for different 
lags:r,,r2,...r^ 

(iv) Determine the R^CF function of the residual series for 
different lags: (/)j,,(/>2,,...0^^^^ 

(v) Generate correlograms for the ACF and R^CF, and 
make sure that series is stationary 

(vi) Evaluate different AR models based on their internal 
and external predictive accuracies, and select the most 
parsimonious AR model (often, 1 or 2 terms should suf- 
fice) — see Sect. 9.5.4 for general recommendations: 



Yt = {bo + bi ■ xi^t H \-bk- Xk,t) 

+ {(piZi^i + (piZi-i H h 4>p2i~p) 



(9.35b) 



(vii) Calculate the RMSE of the overall model (trend plus 
seasonal plus stochastic) thus identified. 

To Forecast a Future Value y^^jWhen Updating Is Possi- 
ble (i.e., the Value K^^ Is Known): 

(i) Compute the series: Z,, Z,_i, . . . Z,_p 

(ii) Estimate Z,+i = (/>! • Z, + 02 • Z,_i H h </>,, • Z,_p+i 

(iii) Finally, use the overall model (Eq. 9.35b) modified to 
time step (t-H 1) as follows: 

F,+i = {bo + b\ ■ xi,,+i H 'rbk- Xk,t+\) + Z,+i 

(9.35c) 

(iv) Calculate approximate 95% prediction limits for the fo- 
recast given by: 



7,+i ± 2 ■ RMSE 



(9.36) 



(v) Re-initialize the series by setting Z as the residual of the 
most recent period, and repeat steps (i) to (iv). 



To Forecast a Future Value When Updating Is Not Possi- 
ble: In case, one lacks observed values for future forecasts 
(such as having to make forecasts over a horizon involving 
several time steps ahead), these are to be estimated in a re- 
cursive manner as follows. The first forecast is made as befo- 
re, but now, one is unable to compute the model error which 
is to be used for predicting the second forecast, and the sub- 
sequent accumulation of errors widens the confidence inter- 
vals as one predicts further into the future. Then: 
(i) Future forecast Y^ ^ , for the case when no updating is 
possible (i.e., Y^ ^ is not known and so one cannot de- 
termine Zj^j): 



Z,+2 = 0^ ■ Z, 



Zr-l + ■ 



-|-( 



z,- 



p+1 



Yt+i = (^0 + b\ ■ xi,,+2 H \-bk ■ Xkj+i) + Z,+2 



(9.37) 



and so forth... 
(ii) An approximate 95% prediction interval for m time 
steps should be determined, and this is provided by the 
software program used. For the simple case of AR(1), 
for forecasts m time-steps ahead : 



Y,+^±2-RMSE[l+(l,i^ + -- 



^^2(m-l)jl/2 



(9.38) 



Note that the ARMA models are usually written as equations 
with fixed estimated parameters representing a stochastic 
structure that does not change with time. Hence, such models 
are not adaptive. This is the reason why some researchers 
caution the use of these models for forecasting several time 
steps ahead when updating is not possible. 

Example 9.5.2: AR model for peak electric demand 
Consider the same data shown in Table 9. 1 for the electric 
utility which consists of four quarterly observations per year 
for 12 years (from 1974-1986). Let us illustrate the use of 
the AR(1) model with this data set and highlight its import- 
ance as compared to the various models described earlier. 

The trend and seasonal model is given in Example 9.4.2. 
This model is used to calculate the residuals {Z } for each of 
the 44 data points. The ACF and FACE functions for {Z } are 
shown in Fig. 9.24. Since the PACE cuts off abruptly after 
lag 1, it is concluded that an AR(1) model is adequate to mo- 
del the stochastic residual series. The corresponding model 
was identified by OLS: 

Z, = 0.657 ■ Z,_i + a, 

Note that the value (f)^ = Q.651 is consistent with the value 
shown in the PACE plot of Fig. 9.24. Table 9.5 assembles the 
RMSE values of various models used in previous examples 
as well as for the AR(1) model. The internal RMSE predic- 
tion error for the AR(1) model has decreased to 4.98 which 
is a great improvement compared to 7.86 for the (linear-nsea- 
sonal) regression model assumed earlier. ■ 
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Fig. 9.24 The ACF and PACF 

functions for the residuals in the 
time series data after removing 
the linear trend and seasonal 
behavior 
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Table 9.5 Accuracy of the various modeling approaches when applied to the electric utility load data given in Table 9. 1 (48 data points covering 
1974-1985). The RMSE correspond to internal prediction accuracy of the various models evaluated 

AMA(3) j\MA(5) E WA(0.2) EWA(0.5) Linear Linear + Seasonal Linear + Seasonal+AR(l) 

RMSE 7.68 9.02 8.59 n.53 12.10 7.86 4.98 



Table 9.6 Forecast accuracy of the various models applied to the electric load data. The actual values correspond to the recorded values for the 
four quarters for 1986 





Actual Load (1986) 


AMA(3) 


AMA(5) 


EWA(0.2) 


EWA(0.5) 


Linear 


Linear + Seasonal 


Linear + Seasonal + AR(1) 


Quarter 1 


151.3 


141.1 


142.9 


150.6 


144.8 


157.5 


164.3 


155.2 


Quarter 2 


132.9 


143.6 


143.6 


138.2 


142.9 


159.1 


148.6 


140.0 


Quarter 3 


160.5 


140.0 


141.9 


148.1 


144.4 


160.7 


172.4 


162.1 


Quarter 4 


161.0 


141.5 


143.6 


140.2 


143.2 


162.4 


155.6 


147.8 


Average 


151.4 


141.5 


143.0 


144.3 


143.8 


159.9 


160.2 


151.3 


%Diff. in average 


- 


-6.6 


-5.6 


-4.7 


-5.0 


5.6 


5.8 


0.1 


RMSE 


- 


15.96 


14.44 


12.40 


13.40 


13.48 


12.11 


7.78 



Example 9.5.3: Comparison of the external prediction er- 
ror of different models for peak electric demand 
The various models illustrated in previous sections have 
been compared in terms of their internal prediction errors. 
The more appropriate manner of comparing them is in terms 
of their external prediction errors such as bias and RMSE. 
The peak loads for the next four quarters for 1986 will be 
used as the basis of comparison. 

The AR model can also be used to forecast the future va- 
lues for the four quarters of 1986 (Y^^ to Y^,). First, one de- 
termines the residual for the last quarter of 1985 (Z^^) using 
the trend and seasonal model to forecast Yisn%- The AR(1) 
correction is subsequently determined, and finally, the fo- 
recast for the first quarter of 1986 or Y^^, is computed from: 

Z48 = 748 - ihsAi) = 135.1 - (149.05) = -139.5 

Z49 = n-Z48 = (0.657)( - 13.95) = -9.16 

Ya9 = {Ylsa9) + Z49 = (164.34) - 9.16 = 155.2 

Finally, the 95% confidence limits are: Yt+\ ±2. RMSE — 
155.2 ± 2.(4.98) = (145.6, 164.8) 

The individual and mean forecasts for all methods are 
shown in Table 9.6. One can compare the various models 



in how accurately they are able to predict these four indivi- 
dual values. They indicate that the mean differences in fore- 
casts are consistent across models, about 5-6% except for 
AMA(3) which is closer to 7%. Note that the forecast errors 
for AMA and EWA are negative; this is because the inherent 
lags in these smoothing techniques result in forecasts being 
lower than actual. EWA(0.2) is the most accurate among all 
models. The (linear + seasonal) model turns out to be quite 
poor with an average bias of 5. 8%. On the other hand, the 
predictions are almost perfect for the AR model (the bias is 
only 0.1%) while the external prediction RMSE is also the 
lowest at 7.78. This is closely followed by the (linear-nseaso- 
nal) model with RMSE= 12.1 1. The others models have hig- 
her RMSE values. This example clearly illustrates the added 
benefit brought in by the AR(1) term. ■ 



9.5.4 Recommendations on Model 
Identification 

As stated earlier, several researchers suggest that, in most 
cases, it is not necessary to include both the AR and the MA 
elements; one of these two, depending on system behavior, 
should suffice. Further, it is recommended that low order 
models of 3 or less should be adequate in most instances 
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provided the seasonality has been properly filtered out. Other 
texts state that adopting an ARMA model is likely to result 
in a model with fewer terms than those of a pure MA or AR 
process by itself (Chatfield 1989). Some recommendations 
on how to identify and evaluate a stochastic model are sum- 
marized below. 

(a) Stationarity check: Whether the series is stationary or not 
in its mean trend is easily identified (one has also got to 
verify that the variance is stable, for which transformati- 
ons such as taking logarithms may be necessary). A non- 
constant trend in the underlying mean value of a process 
will result in the ACF not dying out rapidly. If seasonal 
behavior is present, the ACF will be able to detect it as 
well since it will exhibit a decay with cyclic behavior. The 
seasonality effect needs to be removed by using an ap- 
propriate regression model (for example, the traditional 
OLS or even a Fourier Series model) or by differencing. 
Figure 9.18 illustrates how a seasonal trend shows up in 
the ACF of the time series data of Table 9.1. The seaso- 
nal nature of the time series is reflected in the clamped 
sinusoidal behavior of the ACF. Differencing is another 
way of detrending the series which is especially useful 
when the cyclic behavior is known (such as 24 h lag dif- 
ferencing for electricity use in buildings). If more than 
twice differencing does not remove seasonality, consider 
a transformation of the time series data using natural lo- 
garithms. 

(b) Model selection: The correlograms of both the ACF and 
the PACF are the appropriate means for identifying the 
model type (whether ARIMA, ARMA, AR or MA) and 
the model order. The identification procedure can be 
summarized as follows (McCain and McCleary 1979): 

(i) For AR(1): ACF decays exponentially, R\CF has a 
spike a lag 1, and other spikes are not statistically 
significant, i.e., are contained within the 95% confi- 
dence intervals 
(ii) For AR(2): ACF decays exponentially (indicative of 
positive model coefficients) or with sinusoidal-expo- 
nential decay (indicative of a positive and a negative 
coefficient), and PACF has two statistically signifi- 
cant spikes 
(iii) For MA(1): ACF has one statistically significant spi- 
ke at lag 1 and PACF damps down exponentially 
(iv) For MA(2): ACF has two statistically significant spi- 
kes (one at lag 1 and one at lag 2), and PACF has an 
exponential decay or a sinusoidal-exponential decay 
(v) For ARMA (1,1): ACF and PACF have spikes at lag 

1 with exponential decay. 
Usually, it is better to start with the lowest values of p 
and q for an ARMA(p, q) process. Subsequently, the mo- 
del order is increased until no systematic patterns are 
evident in the residuals of the model. Most time series 
data from engineering experiments or from physical sys- 
tems or processes should be adequately modeled by low 



orders, i.e., about 1-3 terms. If higher orders are requi- 
red, the analyst should check his data for bias or unduly 
large noise effects. Cross-validation using the sample 
handout approach is strongly recommended for model 
selection since this avoids over-fitting, and would better 
reflect the predictive capability of the model. The model 
selection is somewhat subjective as described above. In 
an effort to circumvent this arbitrariness, objective crite- 
ria have been proposed for model selection. Wei (1990) 
describes several such criteria; the Akaike Information 
Criteria (AIC), the Bayesian Information Criteria (BIC) 
and the Criterion for Autoregressive Transfer function 
(CAT) to name three of several indices. 
(c) Model evaluation: After a tentative time series model has 
been identified and its parameters estimated, a diagno- 
stic check must be made to evaluate its adequacy. This 
check could consist of two steps as described below: 
(i) the autocorrelated function of the simulated series 
(i.e., the time series generated by the model) and 
that of the original series must be close; 
(ii) the residuals from a satisfactory model should be 
white noise. This would be reflected by the samp- 
le autocorrelation function of the residuals being 
close or equal to zero. Since it is assumed that the 
random error terms in the actual process are nor- 
mally distributed and independent of each other 
(i.e., white noise), the model residuals should also 
behave similarly. This is tested by computing the 
sample autocorrelation function for lag k of the re- 
siduals. If the model is correctly specified, the re- 
sidual autocorrelations r^^ (upto about K= 15 or so) 
are themselves uncorrelated, normally distributed 
random variables with mean and variance (1/n), 
where n is the number of observations in the time 
series. Finally, the sum of the squared independent 
normal random variables denoted by the Q statistic 
is computed as: 



Q = n^, 



(9.39) 



k=l 



Q must be approximately distributed as chi-square x^ 
with (K-p-q) degrees of freedom. Lookup Table A.5 
provides the critical value to determine whether or 
not to accept the hypothesis that the model is accep- 
table. 
Despite their obvious appeal, a note of caution on ARMA 
models is warranted. Fitting reliable multi-variate time series 
models is difficult. For example, in case of non-experimen- 
tal data which is not controlled, there may be high correla- 
tion between and within series which may or may not be 
real (there may be mutual correlation with time). An appa- 
rent good fit to the data may not necessarily result in better 
forecasting accuracy than using a simpler univariate model. 
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Though ARMA models usually provide very good fits to the 
data series, often, a much simpler method may give results 
just as good. 

An implied assumption of ARMA models is that the data 
series is stationary and normally distributed. If the data se- 
ries is not, it is important to find a suitable transformation to 
make it normally distributed prior to OLS model fitting. Fur- 
ther, if the data series has non-random disturbance terms, the 
maximum likelihood estimation (MLE) method described 
in Sect. 10.4.3 is said to be statistically more efficient than 
OLS estimation. The reader can refer to Wei (1990), Box 
and Jenkins (1976), or Montgomery and Johnson (1976) for 
more details. The autocorrelation function and the spectral 
method are closely related; the latter can provide insight into 
the appropriate order of the ARMA model (Chatfield 1989). 



9.6 ARMAX or Transfer Function Models 



9.6.1 Conceptual Approach and Benefit 



However, there are systems whose response cannot be sa- 
tisfactorily modeled using ARMA models alone since their 
mean values vary greatly over time. The obvious case is of 
dynamic systems which have some sort of feedback in the in- 
dependent or regressor variables, and explicit recognition of 
such effects need to be considered. Traditional ARMA mo- 
dels would then be of limited use since the error term would 
include some of the structural variation which one could 
directly attribute to the variation of the regressor variables. 
Thus, the model predictions could be biased with uncertainty 
bands so large as to make predictions very poor, and often 
useless. In such cases, a model relating the dependent variab- 
le with lagged values of itself plus current and lagged values 
of the independent variables, plus the error term captured by 
the time-series model, is likely to be superior to the ARMA 
models alone (see Fig. 9.25b). Such a model formulation is 
called "Multivariate ARMA" (or MARMA) or ARMAX mo- 
dels or transfer function models. Such models have found ex- 
tensive applications is engineering, econometrics and other 
disciplines as well, and are briefly described below. 



The ARIMA models presented above involve detrending the 
data (via the "Integrated" component) prior to modeling. 
The systematic stochastic component is modeled by ARMA 
models which are univariate by definition since they only 
consist of lagged variables of the detrended series {Z }. An 
alternate form of detrending is to use OLS models such as 
described in Sect. 9.4 which can involve indicator variables 
for seasonality as well as the time variable explicitly. One 
can even have other "independent" variables X appear in the 
OLS model if necessary; for example: y=/(X). However, 
such models do not contain lagged variables in X, as shown 
in Fig. 9.25a. That is why the ARMA models are said to be 
basically univariate since they relate to the detrended series 
{Z}. This series is taken to be the detrended response of 
white noise plus a feedback loop whose effect is taken into 
consideration via the variation of the lagged variables. Thus, 
the stochastic time series data points are in equilibrium over 
time and fluctuate about a mean value. 



9.6.2 Transfer Function Modeling of Linear 
Dynamic Systems 

Dynamic systems are modeled by differential equations. A 
simple example taken from Kreider et al. (2009) will illustra- 
te how linear differential equations can be recast as ARMAX 
models such that the order of the differential equation is 
equal to the number of lag terms in a time series. Consider a 
plane wall represented by a ICIR thermal network as shown 
in Fig. 1.5 (see Sect. 1.2.4 for an introductory discussion 
about representing the transient heat conduction through a 
plane wall by electrical network analogues). The internal 
node is the indoor air temperature T which is assumed to be 
closely coupled to the thermal mass of the building or room. 
This node is impacted by internal heat loads generated from 
people and various equipment (Q) and also by heat conduc- 



Fig. 9.25 Conceptual diffe- {XJ 

rence between the single-va- 
riate ARMA approach and the 
muMvariate ARMAX approach fYJ 

apphed to dynamic systems. 
a Traditional ARMA approach. 
b ARMAX approach 
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tion from the outdoors at temperature T^ through the outdoor 
wall with an effective resistance R. 

The thermal performance of such a system is modeled by: 



To - Ti 
CTi = ^^ + Q 



(9.40a) 



where Ti is the time derivative of T. 

1 

Introducing the time constant t=RC, the above equation 
can be re- written as: 



xTi + Ti = To + RQ 



(9.40b) 



For the simplifying case when both the driving terms T and 
Q are constant, one gets 



T Tit) +T{t) = where 7(0 = Ti{t) - To - RQ 



(9.41) 



The solution is 



T(t+l)^ r(0) • exp 



t+\ 



= r(0)exp( -- j exp I — 



T{t) ■ exp 



which can be expressed as: 



r(/) -|-ai7'(/ — 1) = where ai = — exp 



(9.42) 



The first order ODE is, thus, recast as the traditional single- 
variate AR(1) model with one-lag term. In this case, there 
is a clear interpretation of the coefficient a^ in terms of the 
time constant of the system. Example 7.10.2 illustrates the 
use of such models in the context of operating a building so 
as to minimize the cooling energy use and demand during the 
peak period of the day. 

Kreider et al. (2009) also give another example of a net- 
work with two nodes (i.e., 2R2C network) where, for a simi- 
lar assumption of constant driving terms T and Q, one ob- 
tains a second order ODE which can be cast as a time series 
model with two lag terms, with the time series coefficients 
(or transfer function coefficients) still retaining a clear rela- 
tion with the resistances and the two time constants of the 
system. For more complex models and for cases when the 
driving terms are not constant, such clear interpretation of 
the time series model coefficients in terms of resistances and 
capacitances would not exist since the same time series mo- 
del can apply to different RC networks, and so uniqueness 
is lost. 



For the general case of the free response of a non-aircon- 
ditioned room or building represented by indoor air tempera- 
ture T and which is acted upon by two driving terms T and 
Q which are time variant, the general form of the transfer 
function or ARM AX model of order n is: 

Tjj + aiTij^i + a2Tij^2 + • ■ ■ + a„Tij^„ 

= boTo,t + bxToj-\ + b2Toj-2 H h b„Toj^n 

+ coQ, +cig,_i +C2Qt-2 H \-c„Qt-n 

(9.43) 

The model identification process, when applied to the ob- 
served time series of these three variables, would determi- 
ne how many coefficients or weighting factors to retain in 
the final model for each of the variables. Once such a model 
has been identified, it can be used for accurate forecasting 
purposes. In some cases, physical considerations can impo- 
se certain restrictions on the transfer function coefficients, 
and it is urged that these be considered since it would result 
in sounder models. For the above example, it can be shown 
that at the Umit of steady state operation when present and 
lagged values of each the three variables are constant, heat 
loss would be expressed in terms of the overall heat loss co- 
efficient U times the cross-sectional area A perpendicular to 
heat flow: 2,^^. =C/A(r -T). This would require that the fol- 
lowing condition be met: 



(1 +fli 



a2 



■a,,) — (bo -\-b\+b2 



bn) 
(9.44) 



The transfer function approach^ has been widely used in se- 
veral detailed building energy simulation software programs 
developed during the last 30 years to model unsteady state 
heat transfer and thermal mass storage effects such as wall 
and roof conduction, solar heat gains and internal heat gains. 
It is a widely understood and accepted modeling approach 
among building energy professionals (see for example, Krei- 
der et al. 2009). The following example illustrates the ap- 
proach. 

Example 9.6.1 : Transfer function model to represent un- 
steady state heat transfer through a wall 
For the case when indoor air temperature T is kept constant 
by air-conditioning, Eq. 9.43 can be re-expressed as follows 
consistent with industry practice: 

QcondJ = ~ diQcond,l-lAl ~ d2Qcond,t-2Al — ■ ■ ■ 

+ boTsolair,! + biTsolair.t-lAl + b2Tmlai,-,l-2-Al + ' ' ' 

n>0 

(9.45a) 



^ Strictly, this formulation should be called discrete transfer function or 
z-transform since it uses discrete time intervals (of one hour). 
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Table 9.7 Conduction transfer function coefficients for a 4" concrete 
wall with 2" insulation 





n=0 


n=l 


n=2 


n=3 


n=4 


K 


0.00099 


0.00836 


0.00361 


0.00007 


0.00 


d„ 


1.00 


-0.93970 


0.04664 


0.00 


0.00 


ECn 


0.01303 











or 



QcondJ — ~ / ^ d„Qcond,t-nhl + / , bol soIairJ-tiAt 
n > 1 «>0 

-r,^c„ (9.45b) 



n>0 



where 

Q^^^^ is the conduction heat gain through the wall 

T , is the sol-air temperature (a variable which includes the 

solair ^ ^ 

combined effect of outdoor dry-bulb air temperature 
and the solar radiation incident on the wall) 

T is the indoor air temperature 

A? is the time step (usually 1 h) 

and b , c and d are the transfer function coefficients 

n' n n 

Table 9.7 assembles values of the transfer function coef- 
ficients for a 4" concrete wall with 2" insulation. For a given 
hour, say 10:00 am, Eq. 9.45 can be expressed as: 

Gconrf.io = (0.93970) e„,„rf,9 - (0.04664) e™„rf,8 

+ (0.00099)r„„„,>,io + (0.00836)7:,„,„,>,9 
+ (0.00361)r,„/„,-,,8 + (0.00007)7:™/„,>,7 - (0.01303)7; 

(9.46) 

First, values of the driving terms T^^^^,^ have to be computed 
for all the hours over which the computation is to be perfor- 
med. To start the calculation, initial guess values are assu- 
med for O j„ and O j„. For a specified T, one can then 
calculate Q_^^^^ ,(, and repeat the recursive calculation for each 
subsequent hour. The heat gains are periodic because of the 
diurnal periodicity of T^^^. The effect of the initial guess valu- 
es soon dies out and the calculations attain the desired accu- 
racy after a few iterations. ■ 



9.7 Quality Control and Process Monitoring 
Using Control Chart Methods 

9.7.1 Background and Approach 

The concept of statistical quality control and quality assu- 
rance was proposed by Shewart in the 1920s (and, hence, 
many of these techniques bear his name) with the intent of 
using sampling and statistical analysis techniques to impro- 



ve and maintain quality during industrial production. Pro- 
cess monitoring, using control chart techniques provides 
an ongoing check on the stability of the process and points 
to problems whose elimination can reduce variation and 
permanently improve the system (Box and Luceno 1997). 
It has been extended to include condition monitoring and 
performance degradation of various equipment and systems, 
and for control of industrial processes involving process ad- 
justments using feedback control to compensate for sources 
of drift variation. The basic concept is that variation in any 
production process is unavoidable the causes of which can 
be categorized into: 

(i) Common causes or random fluctuations due to the over- 
all process itself, such as variation in quality of raw ma- 
terials and in consistency of equipment performance — 
these lead to random variability and statistical concepts 
apply; 
(ii) Special or assignable causes (or non-random or spora- 
dic changes) due to specific deterministic circumstances, 
such as operator error, machine fault, faulty sensors, or 
performance degradation of the measurement and cont- 
rol equipment. 
When the variability is due to random or common causes, 
the process is said to be in statistical control. The normal 
curve is assumed to describe the process measurements with 
the confidence limits indicated as the upper and lower cont- 
rol limits (UCL and LCL) as shown in Fig. 9.26. The practi- 
ce of plotting the attributes or characteristics of the process 
over time on a plot is called monitoring via control charts. It 
consists of a horizontal plot which locates the process mean 
(called "centerline") and two lines (the UCL and LCL li- 
mits), as shown in Fig. 9.27. The intent of statistical process 
monitoring using control charts is to detect the occurrence of 
non-random events which impact the central tendency and 
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Fig. 9.26 The upper and lower three-sigma limits indicative of the 
UCL and LCL limits shown on a normal distribution 
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Fig. 9.27 The Shewhart control chart with primary limits 



the variability of the process, and thereby take corrective ac- 
tion in order to ehminate them. Thus, the process can again 
be brought back to stable and statistical control as quickly 
as possible. These limits are often drawn to correspond to 3 
times the standard deviation so that one can infer that there 
is strong evidence that points outside the limits are faulty or 
indicate an unstable process. The decision as to whether to 
deem an incoming sample of size n at a particular point in 
time to be in or out of control is, thus, akin to a two-tailed 
hypothesis test where; 

Null hypothesis Hq : Process is in control, Z = /xq 

Alternative hypothesis Ha : process out of control X ^ //q 

(9.47) 

where X denotes the sample mean of n observations and ^^ 
the expected mean deduced from an in-control process. Note 
that the convention in statistical quality control literature is 
to use upper case letters for the mean value. Just as in hypo- 
thesis testing (Sect. 4.2), type I and type II errors can result. 
For example, type I error (or false positive error) arises when 
a sample (or point on the chart) of an in-line control process 
falls outside the control bands. As before, the probability 
of occurrence is reduced by making appropriate choices of 
sample size and control limits. 



9.7.2 Shewart Control Charts for Variables 
and Attributes 

The Shewart control chart method is a generic name which 
includes a number of different charts. It is primarily meant to 
monitor a process, while it is said to be of limited usefulness 
for adjusting the process. 



(a) Shewart chart for variables: for continuous measure- 
ments such as diameter, temperature, flow, as well as derived 
parameters or quantities such as overall heat loss coefficient, 
efficiency, . . . 

(i) mean or X chart is used to detect the onset of bias in 
measured quantities or estimated parameters. This de- 
tection is based on a two-tailed hypothesis test assuming 
a normal error distribution. The control chart plots are 
deduced as upper and lower control limits about the cen- 
terline where the norm is to use the 3-sigma confidence 
limits for the z-value: 



when s is known : 
[UCL,LCL]- 



:X±3 



s 



(9.48) 



where s is the process standard deviation and (sW^) is 
the standard error of the sample means. Recall that the 
3-sigma limits include 99.74% of the area under the nor- 
mal curve, and hence, that the probability of a type I er- 
ror (or false positive) is only 0.26%. When the standard 
deviation of the process is not known, it is suggested 
that the average range ^ of the numerous samples be 
used for 3-sigma limits as follows: 



when s is not known : 

{UCL, LCL}- ^X±A2-R 



(9.49) 



where the factor A^ is given in Table 9.8. Note that this 
factor decreases as the number of samples increases. 
Recall that the range for a given sample is simply the 
difference between the highest and lowest values of the 
given sample. 

Devore and Farnum (2005) cite a study which demon- 
strated that the use of medians and the interquartile ran- 
ge (IQR) was superior to the traditional means and range 
control charts. The former were found to be more robust, 
i.e., less influenced by spurious outliers. The suggested 
control limits were: 



{UCL,LCL}=^X±3- 



IQR 

k„{n) 



1/2 



(9.50) 



where X is the median and the values of k are selected 

n 

based on the sample size n given by the Table 9.9. 
(ii) range or R charts to control variation to detect unifor- 
mity or consistency of a process. The range is a rough 
measure of the "rate of change" of the observed variable 
which is a more sensitive measure than the mean. Hen- 
ce, a point which is out of control on the range chart 
may be flagged as an abnormality before the mean chart 
does. Consider the case of drawing k samples each with 
sample size n (i.e., each sample consists of drawing n 
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Table 9.8 Numerical values of the three coefficients to be used in 
Eqs. 9.49 and 9.51 for constructing the three-sigma limits for the mean 
and range charts. (Source: Adapted from "1950 ASTM Manual on Qua- 
lity Control of Materials," American Society for Testing and Materials, 
in J. M. Juran, ed., Quality Control Handbook (New York: McGraw- 
Hill Book Company, 1974), Appendix H, p. 39.) 

Number of Factor for determining Factors for determining 

observations in control limits, control control limits, control 
each sample, n chart for the mean, A chart for the range 







D, 


D. 


2 




1.880 


3.268 


3 




1.023 


2.574 


4 




0.729 


2.282 


5 




0.577 


2.114 


6 




0.483 


2.004 


7 




0.419 0.076 


1.924 


8 




0.373 0.136 


1.864 


9 




0.337 0.184 


1.816 


10 




0.308 0.223 


1.777 


11 




0.285 0.256 


1.744 


12 




0.266 0.284 


1.717 


13 




0.249 0.308 


1.692 


14 




0.235 0.329 


1.671 


15 




0.223 0.348 


1.652 


Table 9.9 


Values of the factor k^ to be used in Eq. 9.50 




N 


4 


5 6 7 


8 


K 


0.59{ 


0.990 1.282 1.512 


0.942 



items or taking n individual measurements)''. The 3-sig- 
ma limits for the range chart are given by: 



mean line : 



R 



lower control limit: LCL^ — D^R 



(9.51) 



upper control limit : 



UCL-^ 



DaR 



where R is the mean of the ranges of the k samples, and 
the numerical values of the coefficients D, and D, are 

3 4 

given in Table 9.8 for different number of sample sizes. 

It is suggested that the mean and range chart be used to- 
gether since their complementary properties allow better mo- 
nitoring of a process. Figure 9.28 illustrates two instances 
where the benefit of using both charts reveal behavior which 
one chart alone would have missed. 

There are several variants of the above mean and range 
charts since several statistical indices are available to mea- 
sure the central tendency and the variability. One common 
chart is the standard deviation or s charts, while other types 
of charts involve s^ charts. Which chart to use depends to 
some extent on personal preference. Often, the sample size 
for control chart monitoring is low (around 5 according to 
Himmelblau 1978) which results is standard deviation being 
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Range 
chart 



The mean is in control but the range has shifted 



^ Note the distinction between the number of samples (k) and the samp- 
le size (n). 



Fig. 9.28 The combined advantage provided by the mean and range 
charts in detecting out-of-control processes. Two instances are shown: 
a where the variability is within limits but the mean is out of control 
which is detected by the mean chart, and b where the mean is in control 
but not the variability which is detected by the range chart 



not a very robust statistic. One suggestion is to use the s 
charts when the sample size is around 8-10 or larger, and to 
use range charts for smaller sample sizes. Further, the range 
is easier to visualize and interpret, and is more easily deter- 
mined than the standard deviation. 

Example 9.7.1: Illustration of the mean and range charts 
Consider a process where 20 samples, each consisting of 4 
items, are gathered as shown in Table 9.10. The mean and 
range charts will be used to illustrate how to assess whether 
the process is in control or not. 

The X-bar and R charts are shown in Fig. 9.29. Note that 
no point is beyond the control limits in either plot indicating 
that the process is in statistical control. 

The data in Table 9.10 has been intentionally corrupted 
such that the four items of one sample only have a higher 
mean value (about 1.6). How the mean and range plots flag 
this occurrence is shown in Fig. 9.30 illustrates a case where 
the reverse holds. The process fault is detected by the X-bar 
but not by R chart. Therefore, it is recommended that the 
process be monitored using both the mean and range charts 
which can provide additional insights not provided by each 
control chart technique alone. ■ 

(b) Shewart control charts for attributes In complex as- 
sembly operations (or during condition monitoring involving 
several sensors as in many thermal systems), numerous qua- 
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Table 9.10 Data table for the 20 samples consisting of four items and 
associated mean and range statistics (Example 9.7.1) 


Sample* 


Item 1 


Item 2 


Item 3 


Item 4 


X-bar 
(X) 


Range 

(R) 


1 


1.405 


1.419 


1.377 


1.400 


1.400 


0.042 


2 


1.407 


1.397 


1.377 


1.393 


1.394 


0.030 


3 


1.385 


1.392 


1.399 


1.392 


1.392 


0.014 


4 


1.386 


1.419 


1.387 


1.417 


1.402 


0.033 


5 


1.382 


1.391 


1.390 


1.397 


1.390 


0.015 


6 


1.404 


1.406 


1.404 


1.402 


1.404 


0.004 


7 


1.409 


1.386 


1.399 


1.403 


1.399 


0.023 


8 


1.399 


1.382 


1.389 


1.410 


1.395 


0.028 


9 


1.408 


1.411 


1.394 


1.388 


1.400 


0.023 


10 


1.399 


1.421 


1.400 


1.407 


1.407 


0.022 


11 


1.394 


1.397 


1.396 


1.409 


1.399 


0.015 


12 


1.409 


1.389 


1.398 


1.399 


1.399 


0.020 


13 


1.405 


1.387 


1.399 


1.393 


1.396 


0.018 


14 


1.390 


1.410 


1.388 


1.384 


1.393 


0.026 


15 


1.393 


1.403 


1.387 


1.415 


1.400 


0.028 


16 


1.413 


1.390 


1.395 


1.411 


1.402 


0.023 


17 


1.410 


1.415 


1.392 


1.397 


1.404 


0.023 


18 


1.407 


1.386 


1.396 


1.393 


1.396 


0.021 


19 


1.411 


1.406 


1.392 


1.387 


1.399 


0.024 


20 


1.404 


1.396 


1.391 


1.390 


1.395 


0.014 


Grand 
Mean 










1.398 


0.022 



(a) p-chart for fraction or proportion of defective items 
in a sample (it is recommended that typically n=100 
or so). An analogous chart for tracking the number of 
defectives in a sample, i.e., the variable (n.p.) is also 
widely used; 

(b) c-chart for rate of defects or minor flaws or number of 
nonconformities per unit time. This is a more sophisti- 
cated type of chart where an item may not be defective 
so as to render it useless, but would nevertheless com- 
promise the quality of the product. An item can have 
non-conformities but still be able to function as inten- 
ded. It is based on the Poisson distribution rather than 
the Binomial distribution which is the basis for method 
(a) above. The reader can refer to texts such as Devore 
and Farnum (2005) or Walpole et al. (2007) for more 
detailed description of this approach. 

Method (a) is briefly described below. Let p be the pro- 
bability that any particular item is defective. One manner 
of determining probability p is to infer it as the long run 
proportion of defective items taken from a previous in- 
control period. If the process is assumed to be independent 
between samples, then the expected value and the varian- 
ce of a binomial random variable X in a random sample 
n with p being the fraction of defectives is given by (see 
Sect. 2.4.2): 



lity variables would need to be monitored, and in principle, 
each one could be monitored separately. A simpler procedure 
is to inspect n finished products, denoting a sample, at regu- 
lar intervals and to simply flag the proportion of products 
in the sample found to be defective or non-defective. Thus, 
analysis using attributes would only differentiate between 
two possibilities: acceptable or not acceptable. Different ty- 
pes of charts have been proposed, two important ones are 
listed below: 



E{p) = p 

Pi^ - 
var(/7) = 



P) 



(9.52) 



Thus, the 3 sigma upper and lower limits are given by: 

' P(i - P) 
{UCL,LCL]p = p±3 ' ' 



1/2 



(9.53) 



In case the LCL is negative, it has to be set to zero, since 
negative values are physically impossible. 



Fig. 9.29 Shewart charts for 
mean and range using data from 
Table 9.10. a X-bar chart. 
b Range chart 



Fig. 9.30 Shewart charts when 
one of the samples in Table 9.10 
has been intentionally corrupted. 
a X-bar chart, b Range chart 
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Table 9.1 1 


Data table for 


Example 9.7.2 




Sample 


Number of defective 
components 


Fraction 
defective, p 


1 


8 




0.16 


2 


6 




0.12 


3 


5 




0.10 


4 


7 




0.14 


5 


2 




0.04 


6 


5 




0.10 


7 


3 




0.06 


8 


8 




0.16 


9 


4 




0.08 


10 


4 




0.08 


11 


3 




0.06 


12 


1 




0.02 


13 


5 




0.10 


14 


4 




0.08 


15 


4 




0.08 


16 


2 




0.04 


17 


3 




0.06 


18 


5 




0.10 


19 


6 




0.12 


20 


3 




0.06 


Mean 






0.088 



Example 9.7.2: Illustration of the p-chart method 
Consider the data shown in Table 9.1 1 collected from a pro- 
cess where 20 samples are gathered with each sample size 
being n=50. If prior knowledge is available as to the expec- 
ted defective proportion p, then that value should be used. In 
case it is not, and provided the process is generally fault-free, 
it can be computed from the data itself as shown. From the 
table, the mean p value=0.088. This is used as the baseline 
for comparison in this example. 

The centerline and the UCL and LCL values for the p 
chart following Eqs. 9.52 and 9.53 are shown in Fig. 9.31. 
The process can be taken to be in control since the individual 
points are contained within the UCL and LCL bands. Since 
the p value cannot be negative, the LCL is forced to zero; 
this is the reason for the asymmetry in the UCL and LCL 
bands around the CTR. The analogous chart for the number 
of defectives is also shown. Note that the two types of charts 
look very similar except for the numerical values of the UCL 
and LCL; this is not surprising since the number of samples 
n (taken as 50 in this example) is a constant multiplier. ■ 



(c) Practical implementation issues The basic pro- 
cess for constructing control charts is to first gather at least 
k= 25-30 samples of data with a fixed number of objects or 
observations of size n from a production process known to 
be working properly, i.e., one in statistical control. A typical 
value of n for X-bar and R charts is n=5. As the value of n is 
increased, one can detect smaller changes but at the expense 
of more time and money. 

The mean, UCL and LCL values could be preset and un- 
changing during the course of operation, or they could be 
estimated anew at each updating period. Say, the analysis is 
done once a day, with four observations (n=4) taken hourly, 
and the process operates 24 h/day. The limits for each day 
could be updated based on the statistics of the 24 samples 
taken the previous day or kept fixed at some pre-set value. 
Such choices are best done based on physical insights into 
the specific process or equipment being monitored. A prac- 
tical consideration is that process operators do not like fre- 
quent adjustments made to the control limits. Not only can 
this lead to errors in resetting the limits, but this may led to 
psychological skepticism on the reliability of the entire sta- 
tistical control approach. 

When a process is in control, the points from each samp- 
le plotted on the control chart should fluctuate in a random 
manner between the UCL and the LCL with no clear pattern. 
Several "rules" have been proposed to increase the sensiti- 
vity of Shewhart charts. Other than "no points outside the 
control limits", one could check for such effects as: (i) the 
number of points above and below the centerline are about 
equal, (ii) there is no steady rise or decrease in a sequence 
of points, (iii) most of the points are close to the centerline 
rather than hugging the limits, (iv) there is a sudden shift in 
the process mean, (v) cyclic behavior... 

Devore and Farnum (2005) present an extended list of 
"out-of-control" rules involving counting the number of po- 
ints falling within different bounds corresponding to one, 
two and three sigma lines. Eight different types of out-of- 
control behavior patterns are shown to illustrate that several 
possible schemes can be devised for process monitoring. Ot- 
hers have developed similar types of rules in order to increa- 
se the sensitivity of the monitoring process. However, using 
such types of extended rules also increases the possibility of 
false alarms (or type I errors), and so, rather than being ad 
hoc, there should be some statistical basis to these rules. 



Fig. 9.31 Shewart p-charts 
a Chart for the proportion of 
defectives, b Chart for the num- 
ber of defectives (n.p) 
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9.7.3 Statistical Process Control Using Time 
Weighted Charts 

Traditional Shewhart chart methods are based on investiga- 
ting statistics (mean or range, for example) of an individual 
sample data of n items. Time weighted procedures allow 
more sensitive detection by basing the inferences, not on one 
individual sample statistics, but on the cumulative sum or the 
moving sum of a series of successive observations. This is 
somewhat similar to the rather heuristic "out-of-control" ru- 
les stated by Devore and Famum (2005), but now this is done 
in a more statistically sound manner. Typical time weighted 
approaches are those using Cusum and moving average met- 
hods since they have a shorter average run length in detecting 
small to moderate process shifts. In short, they incorporate 
past history of process and are more sensitive to small gradu- 
al changes. However, non-normality and serial correlation in 
the data has an important effect on conclusions drawn from 
Cusum plots (Himmelblau 1978). 

9.7.3.1 Cusum Charts 

Cusum or cumulative sum charts are similar to the Shewart 
charts in that they are diagnostic tools which indicate whet- 
her a process has gone out of control or not due to the onset 
of special non-random causes. However, the cusum appro- 
ach makes the inference based on a sum of deviations rather 
than individual samples. They damp down random noise 
while amplifying true process changes. They can indicate 
when and by how much the mean of the process has shifted. 
Consider a control chart for the mean with a reference or 
target level established at pL^. Let the sample means be given 
by {X\,X2-.-Xr) ■ Then, the first r cusums are computed as: 

S\ — X\ — ^^ 

S2^ Si+ (X2 - fio) = (Xi - /xo) + (X2 - Mo) 



(9.54) 



Sr = Sr-\ + (Xr - Mo) = I] {Xi - /Xq) 

1 = 1 



shaped as a Vee (since the slope is indicative of a change 
in the mean of the process) which is placed over the most 
recent point. If the data points fall within the opening of 
the Vee, then the process is considered to be in control, ot- 
herwise it is not. If there is no shift in the mean, the cusum 
chart should fluctuate around the horizontal line. Even a 
moderate change in the mean, however, would result in the 
cusum chart exhibiting a slope with each new observation 
highlighting the slope more distinctly. Cusum charts are 
drawn with pre-established limits set by the user which ap- 
ply to the mean: 
AQL acceptable quality level i.e., when the process is in- 

control5 =/^g 
RQL rejectable quality level, i.e., when the process is out- 
of-control, Sr y^ Mo 
Clearly, these limits are similar to the concepts of null 
and alternate mean values during hypothesis testing. The 
practical advantages and disadvantages of this approach are 
discussed by Himmelblau (1978). The following example 
should clarify this approach. 

Example 9.7.3: Illustration of Cusum plots 

The data shown in Table 9.10 represents a process known to 

be in control. Two tests will be performed: 

(a) Use the Cusum approach to verify that indeed this data 
is in control 

(b) Corrupt only one of the 20 sample data as done in 
Example 9.7.1 (i.e., the numerical values of four items 
forming the sample) and illustrate how the Cusum chart 
behaves under this situation. 

Figure 9.32a shows the Cusum plot with the Vee mask. 
Since no point is outside the opening, the process is deemed 
to be in control. However, when the corruption of one data 
point (case b above) is introduced, one notes from Fig. 9.32b 
that the Cusum chart signals this effect quite dramatically 
since several points are outside the opening bounded by the 
Vee mask. ■ 



The above discussion applied to the mean residuals, i.e., the 
difference between the measured value and its expected va- 
lue. Such charts could be based on other statistics such as the 
range, the variable itself, absolute differences, or successive 
differences between observations. 

The cusum chart is simply a plot of S over time like the 
Shewart charts, but they provide a different type of visual 
record. Since the deviations add up i.e., cumulate, an in- 
crease (or decrease) in the process mean will result in an 
upward (or downward) slope of the value S^.. The magnitude 
of the slope is indicative of the size of the change in the 
mean. Special templates or overlays are generated accor- 
ding to certain specific rules for constructing them (see for 
example, Walpole et al. 2007). Often, these overlays are 



9.7.3.2 EWMA Monitoring Process 

Moving average control charts can provide greatest sensitivi- 
ty in process monitoring and control since information from 
past samples is combined with that of the current sample. 
There is, however, the danger that an incipient trend which 
gradually appears in the past observations may submerge 
any small shifts in the process. The exponential weighted 
moving average (EWMA) process (discussed in Sect. 9.3.2) 
which has direct links to the ARl model (see Sect. 9.5.3) has 
redeeming qualities which make it attractive as a statistical 
quality control tool. One can apply this monitoring approach 
to either the sample mean of a set of observations forming 
a sample, or to individual observations taken from a system 
while in operation. 
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Fig. 9.32 a The Cusum chart 
with data from Table 9T0. 
b The Cusum chart with one data 
sample intentionally corrupted 
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An exponential weighted average with a discount factor 
e such that -\ <e <+\ would be (Box and Luceno 1997) 



F, = (1 - 6)(Y, 



y,_i+02.y,_2 + ---) (9-55) 



where the constant (1 - 6*) is introduced in order to normalize 
the sum of the series to unity since 

(l+6>+6'2 + ... ) = (l-6>)-' 



Thus, if one assumes 1=0.4, then (O'^/aY) — 0.5. The bene- 
fits of both the traditional Shewhart charts and the EWMA 
charts can be combined by generating a co-plot (in the above 
case, the three-sigma bands for EWMA will be half the width 
of the three-sigma Shewhart mean bands). Both sets of me- 
trics for each observation can be plotted on such a co-plot for 
easier visual tracking of the process. An excellent discussion 
on EWMA and its advantage in terms of process adjustment 
using feedback control is provided by Box and Luceno (1997). 



Instead of using Eq. 9.55 to repeatedly recalculate Y, with 
each fresh observation, a convenient updating formula is: 



9.7.4 Concluding Remarks 



Y, = XY, 



Y,^\ 



where the new variable (seemingly redundant) is introduced 
by convention such that 1=1-6'. 

If 1= 1, all the weight is placed in the latest observation, 
and one gets the Shewhart chart. 

Consider the following sequence of observations from an 
operating system (Box and Luceno 1997): 



(9.56) There are several other related analysis methods which have 
been described in the literature. To name a few (Devore and 
Farnum 2005): 



Observation 
Y 



10 



12 11 



10 



Note that the starting value of 10 is taken to be the target 
value. If 1 = 0.4, then: 

Fi =(0.4x6) + (0.6x10) = 8.4 
F2 = (0.4x9) + (0.6x8.4) = 8.64 
F3 = (0.4x12) + (0.6x8.64) = 9.98 

and so on. 

If the process is in perfect state of control and any deviati- 
ons can be taken as a random sequence with standard devia- 
tion Gy, it can be shown that the associated standard deviation 



of the EWMA process is: 



a- — ay 

Y 



1/2 



(a) Process capability analysis: This analysis provides 
a means to quantify the ability of a process to meet 
specifications or requirements. Just because a process 
is in control does not mean that specified quality cha- 
racteristics are being met. Process capability analysis 
compares the distribution of process output to speci- 
fications when only common causes determine the 
variation. Should any special causes be present, this 
entire line of enquiry is invalid, and so one needs to 
carefully screen data for special effects before under- 
taking this analysis. Process capability is measured by 
the proportion of output that can be produced within 
design specifications. By collecting data, constructing 
frequency distributions and histograms, and computing 
basic descriptive statistics (such as mean and variance), 
the nature of the process can be better understood. 

(b) Pareto analysis for quality assessment: Pareto Analysis 
is a statistical procedure that seeks to discover from an 
analysis of defect reports or customer complaints which 
"vital few" causes are responsible for most of the repor- 
ted problems. The old adage states that 80% of reported 
problems can usually be traced to 20% of the various 
underlying causes. By concentrating one's efforts on 

(() 57) rectifying the vital 20%, one can have the greatest im- 

mediate impact on product quality. It is used with attri- 
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Table 9.12 Relative effectiveness of control charts in detecting a chan- 
ge in a process. (From Himmelblau 1978) 

Cause of change Mean Range Control chart 

(X) (R) Standard Cumulative 
deviation sum 
(s) (CS) 



Gross error (blunder) 


1 


2 


- 


3 


Shift in mean 


2 


- 


3 


1 


Shift in variability 


- 


1 


- 


- 


Slow fluctuation (trend) 


2 


- 


- 


1 


Rapid fluctuation (cycle) 


- 


1 


2 


- 



1 = most useful, 2 = next best, 3 = least useful, and - = not 
appropriate 



bute data based on histograms/frequency of each type of 
fault, and reveals the most frequent defect. 
There are several instances when certain products and 
processes can be analyzed with more than one method, and 
there is no clear cut choice. X and R charts are quite robust 
in that they yield good results even if the data is not nor- 
mally distributed, while Cusum charts are adversely affected 
by serial correlation in the data. Table 9.12 provides useful 
practical tips as to the effectiveness of different control chart 
techniques under different situations. 



Problems 

Pr. 9.1 Consider the time series data given in Table 9.1 
which was used to illustrate various concepts throughout this 
chapter. Example 9.4.2 revealed that the model was still not 
satisfactory since the residuals still has a distinct trend. You 
will investigate alternative models (such as a second order 
linear or an exponential) in an effort to improve the residual 
behavior 

Perform the same types of analyses as illustrated in the 
text. This involves determining whether the model is more 
accurate when fit to the first 48 data points, whether the re- 
siduals show less pronounced patterns, and whether the fo- 
recasts of the four quarters for 1986 have become more ac- 
curate. Document your findings in a succinct manner along 
with your conclusions. 

Pr. 9.2 Use the following time series models for forecasting 
purposes: 

(a) Z, =20 + e, -|-0.45£,_i -0.35e,_2. Given the latest 
four observations: 

{17.50,21.36, 18.24, 16.91}, compute forecasts for the 
next two periods 

(b) Z, = 15 + 0.86Z,_i - 0.32Z,_2 + £,. Given the latest 
two values of Z {32, 30), determine the next four forecasts. 



Pr. 9.3 Section 9.5.3 describes the manner in which vari- 
ous types of ARMA series can be synthetically generated 
as shown in Figs. 9.21, 9.22 and 9.23, and how one could 
verify different recommendations on model identification. 
These are useful aids for acquiring insights and confidence 
in the use of ARMA. You are asked to synthetically generate 
50 data points using the following models and then use the- 
se data sequences to re-identify the models (because of the 
addition of random noise, there will be some differences in 
model parameters identified); 

(a) Z, = 5 + e, + 0.7e,_i with N{(), 0.5) 

(b) z,^5 + s, + 0.7e,_i with N{Q, 1) 

(c) Z, = 20 -h 0.6Z,_i + e, with N{Q, 1) 

(d) Z, = 20 -h 0.8Z,_i - 0.2Z,_i + e, with A^(0, 1) 

(e) Z, = 20 + 0.8Z,_i + e, + 0.7e,_i with A^(0, 1) 

Pr. 9.4 Time series analysis of sun spot frequency per year 
from 1770-1869 

Data assembled in Table B.5 (in Appendix B) represents 
the so-called Wolf number of sunspots per year (n) over 
many years (from Montgomery and Johnson 1976 by per- 
mission of McGraw-Hill). 

(a) First plot the data and visually note underlying patterns 

(b) You will develop at least 2 alternative models using data 
from years 1770-1859. The models should include dif- 
ferent trend and/or seasonal OLS models, as well as sub- 
classes of the ARIMA models (where the trends have 
been removed by OLS models or by differencing). Note 
that you will have to compute the ACF and PACF for 
model identification purposes 

(c) Evaluate these models using the expost approach where 
the data for years 1860-1869 are assumed to be known 
with certainty (as done in Example 9.6.1). 

Pr. 9.5 Time series of yearly atmospheric CO^ concentrati- 
ons from 1979-2005 

Table B.6 (refer to Appendix B) assembles data of yearly 
carbon-dioxide (CO^) concentrations (in ppm) in the atmo- 
sphere and the temperature difference with respect to a base 
year (in °C) (from Andrews and Jelley 2007 by permission of 
Oxford University Press). 

(a) Plot the data both as time series as well as scatter plots 
and look for underlying trends 

(b) Using data from years 1979-1999, develop at least two 
models for the Temp, difference variable. These could 
be trend and /or seasonal or ARIMA type models 

(c) Repeat step (b) but for the CO^ concentration variable 

(d) Using the same data from 1979-1999, develop a model 
for CO, where Temp.diff is one of the regressor variables 

(e) Evaluate the models developed in (c) and (d) using data 
from 2000-2005 assumed known (this is the expost con- 
ditional case) 
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Fig. 9.33 Monthly mean global 
CO, concentration for the period 
2002-2007. The smoothened 
line is a moving average over 10 
adjacent months. (Downloaded 
from NOAA website http://www. 
cmdl.noaa.gov/ccgg/trends/index. 
php#mlo, 2006) 
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(f) Compare the above results for the exante unconditional 
situation. In this case, future values of temperature diffe- 
rence are not know, and so model developed in step (b) 
will be used to first predict this variable, which will then 
be used as an input to the model developed in step (d) 

(g) Using the final model, forecast the CO^ concentration 
for 2006 along with 95% CL. 

Pr. 9.6 Time series of monthly atmospheric CO^ concentra- 
tions from 2002-2006 

Figure 9.33 represents global CO^ levels but at monthly 
levels. Clearly there is both a long-term trend and a cyc- 
lic seasonal variation. The corresponding data is shown in 
Table B.7 (and can be found in Appendix B). You will use 
the first four years of data (2002-2005) to identify different 
moving average smoothing techniques, trend-nseasonal OLS 
models, as well as ARIMA models as illustrated through se- 
veral examples in the text. Subsequently, evaluate these mo- 
dels in terms of how well they predict the monthly values of 
the last year i.e., year 2006). 

Pr. 9.7 Transfer function analysis of unsteady state heat 
transfer through a wall 

You will use the conduction transfer function coefficients 
given in Example 9.6.1 to calculate the hourly heat gains 
(Qcond^ through the wall for a constant room temperature of 
24°C and the hourly solar-air temperatures for a day given 
in Table 9.13 (adapted from Kreider et al. 2009). You will 
assume guess values to start the calculation, and repeat the 



diurnal calculation over as many days as needed to achieve 
convergence assuming the same T , values for successive 
days. This problem is conveniently solved on a spreadsheet. 

Pr. 9.8 Transfer function analysis using simulated hourly 
loads in a commercial building 

The hourly loads (total electrical, thermal cooling and 
thermal heating) for a large hotel in Chicago, IL have been 
generated for three days in August using a detailed building 
energy simulation program. The data shown in Table B.8 (gi- 
ven in Appendix B) consists of outdoor dry-bulb (T^^^) and 
wet-bulb (T^^) temperatures in °F as well as the internal elec- 
tric loads of the building Q.^^ (these are the three regressor 
variables). The response variables are the total building elec- 
tric power use (kWh) and the cooling and heating thermal 
loads (Btu/h). 

(a) Plot the various variables as time series plots and note 
underlying patterns. 

Use OLS to identify a trend and seasonal model using 
indicator variables but with no lagged terms for Total 
Building electric power. 

For the same response variable, evaluate whether the 
seasonal differencing approach, i.e., V2^Y|=Yj-Yj24 is 
as good as the trend and seasonal model in detrending 
the data series. 

Identify ARMAX models for all three response variab- 
les separately by using two days for model identification 
and the last day for model evaluation 



(b) 



(c) 



(d) 
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9 Analysis of Time Series Data 



Table 9.1 3 Data table for Problem 9.7 






Hour 
ending 


Solar-air 
temp (°C) 


Hour 
ending 


Solar-air 
temp (°C) 


Horn- 
ending 


Solar-air 
temp (°C) 


1 


24.4 


9 


32.7 


17 


72.2 


2 


24.4 


10 


35.0 


18 


58.8 


3 


23.8 


11 


37.7 


19 


30.5 


4 


23.3 


12 


40.0 


20 


29.4 


5 


23.3 


13 


53.3 


21 


28.3 


6 


25.0 


14 


64.4 


22 


27.2 


7 


27.7 


15 


72.7 


23 


26.1 


8 


30.0 


16 


75.5 


24 


25.0 



(e) Report all pertinent statistics and compare the results of 
different models. Provide reasons as to why the parti- 
cular model was selected as the best one for each of the 
three response variables. 

Pr. 9.9 Example 9.7.1 illustrated the use of Shewhart charts 
for variables. 

(a) Reproduce the analysis results in order to gain confidence 

(b) Repeat the analysis but using the Cusum and EWMA 
(with 1=0.4) and compare results. 

Pr. 9.10 Example 9.7.2 illustrated the use of Shewhart charts 
for attributes variables. 

(a) Reproduce the analysis results in order to gain confidence 

(b) Repeat the analysis but using the Cusum and EWMA 
(with 1 = 0.4) and compare results. 
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Parameter Estimation Methods 



10 



This chapter covers topics related to estimation of model pa- 
rameters and to model identification of univariate and multi- 
variate problems not covered in earlier chapters. First, the 
fundamental notion of estimability of a model is introduced 
which includes both structural and numerical identifiability 
concepts. Recall that in Chap. 5 issues addressed were rele- 
vant to parameter estimation of linear models and to variable 
selection using step-wise regression in multivariate analysis. 
This chapter extends these basic notions by presenting the 
general statistical parameter estimation problem, and then 
presenting a few important estimation methods. Multivariate 
estimation methods (such as principle component analysis, 
ridge regression and stagewise regression) are discussed 
along with case study examples. Next, the error in variable 
(EIV) situation is treated when the errors in the regressor 
variables are large. Subsequently, another powerful and wi- 
dely used estimation method, namely maximum likelihood 
estimation (MLE) is described, and its application to para- 
meter estimation of probability functions and logistic models 
is presented. Also covered is parameter estimation of models 
non-linear in the parameters which can be separated into tho- 
se that are transformable into linear ones, and those which 
are intrinsically non-linear. Finally, computer intensive nu- 
merical methods are discussed since such methods are being 
increasingly used nowadays because of the flexibility and ro- 
bustness they provide. Different robust regression methods, 
whereby the influence of outliers on parameter estimation 
can be deemphasized, are discussed. This is, followed by the 
bootstrap resampling approach which is applicable for para- 
meter estimation, and for ascertaining confidence limits of 
estimated model parameters and of model predictions. 



timation were addressed in Chap. 5. Basic to proper parame- 
ter estimation is the need to assess whether the data at hand 
is sufficiently rich and robust for the intended purpose; this 
aspect was addressed under design of experiments in Chap. 6. 
There are instances where OLS techniques, though most wi- 
dely used, may not be suitable for parameter estimation. One 
such case occurs when regressors are correlated in a multiple 
linear regression (MLR) problem (recall that in Sect. 5.7.4, 
it was stated that step- wise regression was not recommended 
in such a case). Several techniques have been proposed to 
deal with such a situation, while allowing one to understand 
the underlying structure of the multivariate data and reduce 
the dimensionality of the problem. A basic exposure to these 
important statistical concepts is of some importance. 

Also, parameter estimation of non-linear models relies on 
search methods similar to the ones discussed in Sect. 7.3 un- 
der optimization methods. In fact, the close form solutions for 
identifying the "best" OLS parameters are directly derived by 
minimizing an objective or loss function framed as the sum 
of square errors subject to certain inherent conditions (see 
Eq. 5.3). Such analytical solutions are no longer possible for 
non-linear models or for certain situations (such as when the 
measurement errors in the regressors are large). It is in such 
situations that estimation methods such as the maximum li- 
kelihood method or the different types of computer intensive 
methods discussed in this chapter (which offer great flexibili- 
ty and robustness at the expense of computing resources) are 
appropriate, and have, thereby, gained popularity. 



1 0.2 Concept of Estimability 



10.1 Background 

Statistical indices and residual analysis meant to evaluate the 
suitability of linear models, ways of identifying parsimoni- 
ous models by step-wise regression, and OLS parameter es- 



The concept of estimability is an important one and relates 
to the ill-conditioning of the parameter matrix. It consists 
of two separate issues: structural identifiability and nume- 
rical identifiability, both of which are separately discussed 
below. Both these issues closely parallel the mathematical 
concepts of ill-conditioning of functions and uniqueness of 
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equations which are covered in numerical analysis textbooks 
(for example, Chapra and Canale 1988), and so these are also 
reviewed. 



10.2.1 Ill-Conditioning 

The condition of a mathematical problem relates to its sen- 
sitivity to changes in the data. A computation is numerically 
unstable with respect to round-off and truncation errors if 
these uncertainties are grossly magnified by the numerical 
method. Consider the first order Taylor series: 



fix) = /(xo) + f'ixo)(x - xo) 



(10.1) 



The relative error of f(x) can be defined as: 

fix) - f{xo) _ f'{xQ){x - Xo) (10.2) 



sUix)] = 



/(-^o) 



f{XQ) 



The relative error of x is given by 

X — Xq 



e{x)- 



Xq 



(10.3) 



The condition number is the ratio of these relative errors: 
£[/(•«)] -«o/'(JCo) (10.4) 



C, 



£{x) 



f(xo) 



The condition number provides a measure of the extent to 
which an uncertainty in x is magnified by f(x). A value of 1 
indicates that the function's relative error is identical to the 
relative error in x. Functions with very large values are said 
to be ill-conditioned. 



Example 10.2.1: Calculate the condition number of the 
function f(x)=tanx ai x — {n /2). 



The condition number Cj — 



xo(l/cos^xo) 
tanxo 



The condition numbers at the following values of x are: 

1.7279(40.86) 



(a) atxo =:r/2 + 0.1(7r/2), Cd= 

(b) atxo = Tt/2 + 0.01(7r/2), Cd 



-6.314 

1.5865(4053) 

-63.66 



-11.2 



-101 



For case (b), the major source of ill-conditioning appears 
in the derivative which is due to the singularity of the func- 
tion close to {jt/2). ■ 

Let us extend this concept of ill-conditioning to sets of 
equations (Lipschutz 1966). The simplest case is a system of 
two linear equations in two unknowns in, say, x and y: 



a\x + b\y — ci 
a2X + biy — C2 



(10.5) 



(b) 



Three cases can arise which are best described geometri- 
cally. 

(a) The system has exactly one solution — where both lines 
intersect (see Fig. 10.1a) 

The system has no solution — the lines are parallel as 
shown in Fig. 10.1b. This will arise when the slopes 
of the two lines are equal but the intercept are not, i.e. 

ai bi ci 
when: — — -^ ^ — 

fl2 02 C2 

The system has an infinite number of solutions since 
they coincide as shown in Fig. 10.1c. This will arise 



(c) 



Fig. 10.1 Geometric represen- 
tation of a system of two linear 
equations a the system has exact- 
ly one solution, b the system has 
no solution, c the system has an 
infinite number of equations 
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when both the slopes and the intercepts are equal, i.e., 

fli bi ci 
when — ~;~ — 

02 02 C2 

Consider a set of linear equations represented by 



Ax = b whose solution is x — A b 



(10.6) 



lative perturbation 370 times greater in the solution of the 
system of equations. Thus, even a singular matrix will be 
signaled as an ill-conditioned (or badly conditioned) matrix 
due to roundoff and measurement errors, and the analyst has 
to select thresholds to infer whether the matrix is singular or 
merely ill-conditioned. 



Whether this set of linear equations can be solved or not is 
easily verified by computing the rank of the matrix A which 
is the integer representing the order of the highest non- vanis- 
hing integer Consider the following matrix: 



1 


2 


-2 


3 


2 


-1 


3 


-2 


1 


3 


1 


-4 


3 


6 


-6 


9 



Since the first and last rows are identical, or more cor- 
rectly "linearly dependent", the rank is equal to 3. Compu- 
ter programs would identify such deficiencies correctly and 
return an error message such as "matrix is not positive de- 
finite" indicating that the estimated data matrix is singular 
Hence, one cannot solve for all four unknowns but only for 
three. However, such a test breaks down when dealing with 
real data which includes measurement as well as computa- 
tion errors (during the computation of the inverse of the ma- 
trix). This is illustrated by using the same matrix but the last 
row has been corrupted by a small noise term — of the order 
of 5% only. 



' 1 


2 


-2 


3 


2 


-1 


3 


-2 


-1 
63 

20 


3 
117 

20 


1 
-123 


-4 
171 


20 


20 



Most computer programs if faced with this problem 
would determine the rank of this matrix as 4 ! Hence, even 
small noise in the data can lead to misleading conclusions. 
The notion of condition number, introduced earlier, can be 
extended to include such situations. Recall that any set of 
linear equations of order n has n roots, either distinct or re- 
peated, which are usually referred to as characteristic roots 
or eigenvalues. The stability or the robustness of the solution 
set, i.e., its closeness to singularity, can be characterized by 
the same concept of condition number (C^) of the matrix A 
computed as: 



/ largest eigenvalue \ ' 
\ smallest eigenvalue/ 



(10.7) 



The value of the condition number for the above matrix 
is C. = 371.7. Hence, a small perturbation in b induces a re- 



10.2.2 Structural Identif lability 

Structural identifiability is defined as the problem of investi- 
gating the conditions under which system parameters can be 
uniquely estimated from experimental data, no matter how 
noise-free the measurements. This condition can be detected 
before the experiment is conducted by analyzing the basic 
modeling equations. Two commonly used testing techniques 
are the sensitivity coefficient approach, and one involving 
looking at the behavior of the poles of the Laplace Trans- 
forms of the basic modeling equations (Sinha and Kuszta 
1983). Only the sensitivity coefficient approach to detect 
identifiability is introduced below. 

Let us first look at some simple, almost trivial, examples 
of structural identifiability of models, where y is the system 
response and x is the input variable (or the input variable 
matrix). 

(a) A model such as y — {a + cb)x where c is a system 
constant will not permit unique identification of a and 
b when measurements of y and x are made. At best, the 
overall term (a-Hcb) can be identified. 

(b) A model given by y — {ab)x will not permit explicit 
identification of a and b, merely the product (a • b). 

(c) The model y = b\/(b2 + b^t) where t is the regressor 
variable will not allow all three parameters to be identi- 
fied, only the ratios (b,/bj) and (b^/b^). This case is trea- 
ted in Example 10.2.3 below. 

(d) The lumped parameter differential equation model for 
the temperature drop with time T(t) of a cooling sphe- 
re is given by: Mcp^ — hA(T — Too) where M is the 
mass of the sphere, c its specific heat, h the heat loss 
coefficient from the sphere to the ambient, A the surface 
area of the sphere and Too the ambient temperature as- 
sumed constant (see Sect. 1.2.4). If measurements over 
time of T are made, one can only identify the group 
(Mc /hA). Note that the reciprocal of this quantity is the 
time constant of the system. 

The geometric interpretation of ill-conditioning (see di- 
scussion above relevant to Eq. 10.5) can be related to structu- 
ral identifiability. As stated by Godfrey (1983), one can dis- 
tinguish between three possible outcomes of an estimability 
analysis: 

(i) The model parameters can be estimated uniquely, and 
the model is globally identifiable, 
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lO X-axis 



Fig. 10.2 Plot of Example 10.2.2 used to illustrate local versus global 
identifiability 



(ii) A finite number of alternative estimates of model para- 
meters is possible, and the model is locally identifiable, 
and 

(iii) An infinite number of model parameter estimates are 
possible, and the model is unidentifiable from the data 
(this is the over-parameterized case). 



nitude of change in the response y due to perturbations in the 
values of the parameters. Let i be the number of observation 
sets representing the range under which the experiment was 
performed. The condition for structural identifiability is that 
the sensitivity coefficients over the range of the observations 
should not be linearly dependent. Linear dependence is said 
to occur when, for p parameters in the model, the following 

relation is true for all i observations even if all x values are 

J 

not zero (a formal proof is given by Beck and Arnold 1977): 



Xl 



9y, 
9b, 



■X2 



9yi 
9b^ 



+ x, 



9y, 



= 



(10.9) 



Example 10.2.3: Let us apply the condition given by 
Eq. 10.9 to the model (c) above, namely y — b\/{b2 + bj,t). 
Though mere inspection indicates that all three parameters 
bj, bj and b^ cannot be individually identified, the question 
is "can the ratios (b2^,) and (b3^|) be determined under all 
conditions?" In this case, the sensitivity coefficients are: 



9yi ^ 1 
3bi 

9yi ^ 

9b3 (b2+b3ti)2 



-bjt,' 
-biti 



ay, 

9b2 



-bi 

(b2 + b3ti)2 



and 



(10.10) 



Example 10.2.2:' Let us consider a more involved problem 
to illustrate the concept of a locally identifiable problem. 
Consider the first-order differential equation: 



It is not clear whether there is linear dependence or not. 
One can verify this by assuming Xj=b 
Then, the model can be expressed as 



\=^2 and X3=b3. 



xy' — ly whose solution is y{x) — Cx^ (10.8) 

where C is a constant to be determined from the initial va- 
lue. If y(- 1) = 1, then C = 1. Thus, one has a unique solution 
y(x) — x^ on some open interval about x = -l which pas- 
ses through the origin (see Fig. 10.2). But to the right of the 
origin, one may choose any value for C in Eq. 10.8. Three 
different solutions are shown in Fig. 10.2. Hence, though one 
has uniqueness of the solution near some point or region, the 
solution may branch elsewhere, and the uniqueness may be 
lost. ■ 

The problem of identifiability is almost a non-issue for 
simple models; for example, models such as (a) or (b) ab- 
ove, and even for solved Example 10.2.2. However, more 
complex models demand a formal method rather than de- 
pend on adhoc manipulation and inspection. The sensitivity 
coefficient approach allows formal testing to be performed. 
Consider a model y(t, b) where t is an independent variable 
and b is the parameter vector. The first derivative of y with 
respect to b. is the sensitivity coefficient for b., and is desig- 
nated by (9y/9bj). Sensitivity coefficients indicate the mag- 



' From Edwards and Penney ( 1 996) by ( 
cation. 



' permission of Pearson Edu- 



9y, 
9b, 



■b2 



9yi 
9b2 



9b3 



or 



where 



Z = Zi + Z2 + Z3 



b2 9yi 



(10.11) 



(10.12) 



9y, 

9bi bi 9b2 



b3 9yi 
bi 9b3 



The above function can occur in various cases with li- 
near dependence. Arbitrarily assuming b, = 1 , the variation 
of the sensitivity coefficients, or more accurately those of 
z, Z|, Zj and z^, are plotted against (b^t) in Fig. 10.3. One 
notes that z = throughout the entire range denoting the- 
refore that identification of all three parameters is impos- 
sible. Further, inspection of Fig. 10.3 reveals that both z^ 
and Zj seem to have become constant, therefore becoming 
linearly dependent for b3t>3. This means that not only is it 
impossible to estimate all three parameters simultaneously 
from measurements of y and t, but that it is also impossible 
to estimate both b^ and b^ using data over the spatial range 
b3t>3. ■ 
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Fig. 10.3 Verifying linear dependence of model parameters of 
Eq. 10.12. (From Beck and Arnold 1977 by permi.ssion of Beck) 

10.2.3 Numerical Identifiability 

A third crucial element is the numerical scheme used for 
analyzing measured data. Even when the disturbing noise or 
experimental error in the system is low, OLS may not be ade- 
quate to identify the parameters without bias or minimum va- 
riance because of multi-collinearity effects between regressor 
variables. As the noise becomes more significant, more bias is 
introduced in the parameter estimates and so elaborate nume- 
rical schemes such as iterative methods or multi-step methods 
have to be used (discussed later in this chapter). Recall from 
Sect. 5.4.3 that the parameter estimator vector is given by 
b = (X'X)^^X'Y while the variance-covariance matrix by: 
var( b) = CT (X'X) where a^ is the mean square eiTor of the 
model error terms. Numerical identifiability, also called redun- 
dancy, is defined as the inability to obtain proper parameter 
estimates from the data even if the experiment is structural- 
ly identifiable. This can arise when the data matrix (X'X) is 
close to singular. Such a condition associated with inadequate 
richness in the data rather than with model mis-specification 
is referred to as ill-conditioned data (or, more loosely as weak 
data). If OLS estimation was used with ill-conditioned data, 
the parameters, though unbiased, are not efficient (unreliable 
with large standard errors) in the sense they no longer have 
minimum variance. More importantly, OLS formulae would 
understate both the standard errors and the models prediction 
uncertainty bands (even though the overall fit may be satisfac- 
tory). The only recourse is either to take additional pertinent 
measurements or to simplify or aggregate the model structure 
so as to remove some of the coUinear variables from the mo- 
del. How to deal with such situations is discussed in Sect. 10.3. 
Data is said to be ill-conditioned when the regressor va- 
riables of a model are correlated with each other, leading 
to the correlation coefficient R = X'X matrix (see Eq. 5.32) 



becoming close to singular, and resulting in the reciprocal 
becoming undetermined or very large. This may have serious 
effects on the estimates of the model coefficients (unstable 
with large variance), and on the general applicability of the 
estimated model. There are three commonly used diagnostic 
measures, all three depend on the variance-covariance matrix 
C, which can be used to evaluate the magnitude of ill-condi- 
tioning (Belsley et al. 1980). 



(i) 



(ii) 



The correlation matrix R which is akin to matrix C (gi- 
ven by Eq. 5.31) where the diagonal elements are centered 
and scaled by unity (subtracting by the mean and dividing 
by the standard deviation). This allows one to investiga- 
te correlation between pairs of regressors in a qualitative 
manner, but may be of limited use in assessing the magni- 
tude of overall multicollinearity of the regressor set. 
Variance inflation factors which provide a better quan- 
titative measure of the overall coUinearity. The diagonal 
elements of the C matrix are: 



C,i 



1 



(—5) 



l,2,...k 



(10.13) 



where R^ is the coefficient of multiple determination 
resulting from regressing x on the other (k- 1) variab- 
les. Clearly, the stronger the linear dependency of x. on 
the remaining regressors, the larger the value of R^ . The 
variance of bj is said to be "inflated" by the quantity 
(1-R^). Thus, the variance inflation factors VIE (b) = 
C . The VIE allows one to look at the joint relationship 
among a specified regressor and all other regressors. Its 
weakness, like that of the coefficient of determination 
R-, is its inability to distinguish among several coexis- 
ting near-dependencies and in the inability to assign 
meaningful thresholds between high and low VIE va- 
lues (Belsley et al. 1980). Many texts suggest rules of 
thumb: VIE>10 indicate strong ill-conditioning, while 
5 < VIE< 10 indicates a moderate problem, 
(iii) Condition number, discussed above, is widely used by 
analysts to determine robustness of the parameter esti- 
mates, i.e., how low are their standard errors, since it 
provides a measure of the joint relationship among re- 
gressors. Evidence of coUinearity is suggested for con- 
dition numbers > 15, and corrective action is warranted 
when the value exceeds 30 or so (Chatterjee and Price 
1991). 
To summarize, ill-conditioning of a matrix X is said to 
occur when one or more columns can be expressed as linear 
combinations of another column, i.e., det (X'X) = or close 
to 0. Possible causes are either the data set is inadequate or 
the model is overspecified, i.e., too many parameters have 
been included in the model. Possible remedies one should 
investigate are (i) collecting more data, or (ii) dropping va- 
riables from the model based on physical insights. If these 
fail, one can use biased estimation methods such as ridge 
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regression, or other transformations such as principle com- 
ponent analysis (PCA) which are presented below. 



1 0.3 Dealing with Collinear Regressors 
During Multivariate Regression 

10.3.1 Problematiclssues 

Perhaps the major problem with multivariate regression is 
that the "independent" variables are not really independent 
but collinear to some extent (and hence, the suggestion that 
the term "regressor" be used instead). Strong collinearity 
has the result that the variables are "essentially" influencing 
or explaining the same system behavior. For linear models, 
the Pearson correlation coefficient (presented in Sect. 3.4.2) 
provides the necessary indication of the strength of this over- 
lap. This issue of collinearity between regressors is a very 
common phenomenon which has important implications du- 
ring model building and parameter estimation. Not only can 
regression coefficients be strongly biased, but they can even 
have the wrong sign. Note that this could also happen if the 
range of variation of the regressor variables is too small, or if 
some important regressor variable has been left out. 

Example 10.3.1: Consider the simple example of a linear 
model with two regressors both of which are positively cor- 
related with the response variable y. The data consists of six 
samples as shown in Table 10.1. The pairwise plots shown in 
Fig. 10.4 clearly depict the fairly strong relationship between 
the two regressors. 

From the correlation matrix C for this data (Table 10.2), 
the correlation coefficient between the two regressors is 
0.776, which can be considered to be of moderate strength. 
An OLS regression results in the following model: 



Table 1 0.1 Data table for Example 10.3. 1 



y = 1.30 + 0.75x1 -0.05x2 



(10.14) 



The model identified suggests a negative correlation 
between y and x^ which is contrary to both the correlation 
coefficient matrix and the graphical trend in Fig. 10.4. This 
irrationality is the result of the high inter-correlation bet- 
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Table 10.2 Con-elation matrix for Example 


10.3.1 


^1 


^2 


Y 


X, 1.000 


Q.iie 


0.742 


'^2 


1.000 


0.553 


y 




1.000 



ween the regressor variables. What has occurred is that the 
inverse of the variance-covariance matrix (X'X) of the es- 
timated regression coefficients has become ill-conditioned 
and unstable. A simple layperson explanation is to say that 
Xj has usurped more than its appropriate share of explicative 
power of y at the detriment of x^ which, then, had to correct 
itself to such a degree that it ended up assuming a negative 
correlation. ■ 

Mullet (1976), discussing why regression coefficients in 
the physical sciences often have wrong signs, quotes: (i) Mar- 
quardt who postulated that multicollinearity is likely to be a 
problem only when correlation coefficients among regressor 
variables is higher than 0.95, and (ii) Snee who used 0.9 as the 
cut-off point. On the other hand. Draper and Smith (1981) sta- 
te that multicollinearity is likely to be a problem if the simple 
correlation between two variables is larger than the correla- 
tion of one or either variable with the dependent variable. 

Significant collinearity between regressor variables is li- 
kely to lead to two different problems: 
(i) though the model may provide a good fit to the current 
data, its usefulness as a reliable predictive model is su- 
spect. The regression coefficients and the model predic- 
tions tend to have large standard errors and uncertainty 
bands which makes the model unstable. It is imperative 



Fig. 10.4 Data for Exam- 
ple 10.3.1 to illustrate how 
multicollinearity in the regressors 
could result in model coefficients 
with wrong signs 
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(ii) 



that a sample cross-validation evaluation be performed 
to identify a suitable model (a case study is presented in 
Sect. 10.3.4); 

the regression coefficients in the model are no longer 
proper indicators of the relative physical importance of 
the regressor parameters. 



1 0.3.2 Principle Component Analysis and 
Regression 

Principle Component Analysis (PCA) is one of the best 
known multivariate methods for removing the adverse ef- 
fects of collinearity, while summarizing the main aspects of 
the variation in the regressor set (see for example, Draper and 
Smith 1981 or Chatterjee and Price 1991). It has a simple in- 
tuitive appeal, and though very useful in certain disciplines 
(such as the social sciences), its use has been rather limited 
in engineering applications. It is not a statistical method lea- 
ding to a decision on a hypothesis, but a general method of 
identifying which parameters are collinear and reducing the 
dimension of multivariate data. This reduction in dimensio- 
nality is sometimes useful for gaining insights into the be- 
havior of the data set. It also allows for more robust model 
building, an aspect which is discussed below. 

The premise in PCA is that the variance in the collinear 
multi-dimension data comprising of the regressor variable 
vector X can be reframed in terms of a set of orthogonal (or 
uncorrelated) transformed variable vector U. This vector will 
then provide a means of retaining only a subset of variables 
which explain most of the variability in the data. Thus, the 
dimension of the data will be reduced without losing much 
of the information (reflected by the variability in the data) 
contained in the original data set, thereby allowing a more 
robust model to be subsequently identified. A simple geo- 
metric explanation of the procedure allows better conceptual 
understanding of the method. Consider the two-dimension 
data shown in Fig. 10.5. One notices that much of the varia- 
bility in the data occurs along one dimension or direction. If 



one were to rotate the orthogonal axis such that the major Uj 
axis were to lie in the direction of greatest data variability 
(see Fig. 10.5b), most of this variability will become uni-di- 
rectional with little variability being left for the orthogonal 
Uj axis to account for. The variability in the two-dimensional 
original data set is, thus, largely accounted for by only one 
variable, i.e., the transformed variable u^. 

The real power of this method is when one has a large num- 
ber of dimensions; in such cases one needs to have some mat- 
hematical means of ascertaining the degree of variation in the 
multi-variate data along different dimensions. This is achieved 
by looking at the eigenvalues. The eigenvalue can be viewed 
as one which is indicative of the length of the axis while the 
eigenvector specifies the direction of rotation. 

Usually PCA analysis is done with standardized variables 
Z instead of the original variables X such that variables Z have 
zero mean and unit variance. Recall that the eigenvalues X 
(also called characteristic roots or latent roots) and the eigen- 
vector A of a matrix Z are defined by: 



AZ = XZ 



(10.15) 



The eigenvalues are the solutions of the determinant of the 
covariance matrix of Z: 



IZ'Z - XII = 



(10.16) 



Because the original data or regressor set X is standardized, 
an important property of the eigenvalues is that their sum is 
equal to the trace of the correlation matrix C, i.e.. 



Xi + X2 + ■ ■ ■ + kp — p 



(10.17) 



where p is the dimension or number of variables. This follows 
from the fact that the diagonal elements for a correlation ma- 
trix should sum to unity. Usually, the eigenvalues are ranked 
such that the first has the largest numerical value, the second 
the second largest, and so on. The corresponding eigenvector 
represents the coefficients of the principle components (PCs). 
Thus, the linearized transformation for the PC from the origi- 
nal vector of standardized variables Z can be represented by: 



Original 
Variable Xg 




Original Variable x^ 



Fig. 10.5 Geometric interpretation of wliat a PCA analysis does in 
terms of variable transform for the case of two variables. The rotation 
has resulted in the primary axis explaining a major part of the variabili- 
ty in the original data with the rest explained by the second rotated axis. 



Rotated 
Variable Ug 



_ft_ • 



V- 



Rotated Variable u^ 



Reduction in dimensionality can be achieved by accepting a little loss 
in the information contained in the original data set and dropping the 
second rotated variable altogether 
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PCI : Ml = a\izi + ai2Z2 
All + ai2 + • ■ 
PC2 : M2 = a2izi + aiizi 



a\pZp subject to 



fl?,. = 1 



'ip 



aipZp subject to 



^22 



+ •■ 



«2p = l 



toPCp 



(10.18) 



where a are called the component weights and are the scaled 
elements of the corresponding eigenvector. 

Thus, the correlation matrix (given by Eq. 5.31) for the 
standardized and rotated variables is now transformed into: 



Ai 0\ 



V Xp) _ 



where li > I2 > ■ ■ ■ > ''^p- 

(10.19) 



Note that the off-diagonal terms are zero because the va- 
riable vector U is orthogonal. Further, note that the eigenva- 
lues represent the variability of the data along the principle 
components. 

If one keeps all the PCs, nothing is really gained in terms 
of reduction in dimensionality, even though they are ortho- 
gonal (i.e., uncorrelated), and the model building by regress- 
ion will be more robust. Model reduction is done by rejecting 
those transformed variables U which contribute little or no 
variance to the data. Since the eigenvalues are ranked, PCI 
explains the most variability in the original data while each 
succeeding eigenvalue accounts for increasingly less. A ty- 
pical rule of thumb to determine the cut-off is to drop any 
factor which explains less than (1/p) of the variability, where 
p is the number of parameters or the original dimension of 
the regressor data set. 

PCA has been presented as an approach which allows the 
dimensionality of the multivariate data to be reduced whi- 
le yielding uncorrelated regressors. Unfortunately, in most 
cases, the physical interpretation of the X variables, which 
often represent a physical quantity, is lost as a result of the 
rotation. A few textbooks (Manly 2005) provide examples 
where the new rotated variables retain some measure of phy- 
sical interpretation, but these are the exception rather than 
the rule in the physical sciences. In any case, the reduced set 
of transformed variables can now be used to identify multi- 
variate models that are, often but not always, more robust. 

Example 10.3.2: Consider Table 10.3 where PC rotation 
has already been performed. The original data set contained 
nine variables which were first standardized, and a PCA re- 
sulted in the variance values as shown in the table. One noti- 
ces that PCI explains 41% of the variation, PC2 23%, and so 
on till all nine PC explain as much variation as was present 
in the original data. Had the nine PC been independent or 



Table 10.3 How to interpret PCA results and determine thresholds. 
(From Kachigan 1991 by permission of Kachigan) 

% of total variance Eigenvalues 

accounted for 

Extracted Incremental Cumulative Incremental Cumulative 
factors 



PCI 


41% 


41% 


3.69 


3.69 


PC2 


23 


64 


2.07 


5.76 


PC3 


14 


78 


1.26 


7.02 


PC4 


7 


85 


0.63 


7.65 


PCS 


5 


90 


0.45 


8.10 


PC6 


4 


94 


0.36 


8.46 


PC7 


3 


97 


0.27 


8.73 


PCS 


2 


99 


0.18 


8.91 


PC9 


1 


100 


0.09 


9.00 



orthogonal, each one would have explained on an average 
(l/p) = (l/9) = ll%) of the variance. The eigenvalues hsted 
in the table corresponds to the number of variables which 
would have explained an equivalent amount of variation in 
the data that is attributed to the corresponding PC. For exam- 
ple, the first eigenvalue is determined as: 41/(100/9) = 3.69 
i.e., PCI has the explicative power of 3.69 of the original 
variables, and so on. 

The above manner of studying the relative influence of each 
PC allows heuristic thresholds to be defined. The typical rule 
of thumb as stated above would result in all PC whose eigen- 
values are less than 1 .0 being omitted from the reduced multi- 
variate data set. This choice can be defended on the grounds 
that an eigenvalue of 1.0 would imply that the PC explains 
less than would the original untransformed variable, and so 
retaining it would be illogical since it would be defeating the 
basic purpose, i.e. trying to achieve a reduction in the dimen- 
sionality of the data. However, this is to be taken as a heuristic 
criterion and not as a hard and fast rule. A convenient visual 
indication of how higher factors contribute increasingly less 
to the variance in the multivariate data can be obtained from 
a scree plot generated by most PCA analysis software. This is 
simply a plot of the eigenvalues versus the PCs (i.e., the first 
and fourth columns of Table 10.3), and provides a convenient 
visual representation as illustrated in the example below. ■ 

Example 10.3.3: Reduction in dimensionality using PCA 
for actual chiller data 

Consider data assembled in Table 10.4 which consists of a 
data set of 15 possible variables or characteristic features 
(CFs) under 27 different operating conditions of a centrifu- 
gal chiller. With the intention of reducing the dimensionality 
of the data set, a PCA is performed in order to determine an 
optimum set of principle components. Pertinent steps will be 
shown and the final selection using the eigenvalues and the 
scree plot will be justified. Also the component weights table 
which assembles the PC models will be discussed. 
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Table 10.4 Data table with 15 regressor variables for Example 10.3.3. (From Reddy 2007 from data supplied by James Braun) 


CFl 


CF2 


CF3 


CF4 


CF5 


CF6 


CF7 


CF8 


CF9 


CFIO 


CFll 


CF12 


CF13 


CF14 


CF15 


3.765 


5.529 


5.254 


3.244 


15.078 


4.911 


2.319 


5.473 


83.069 


39.781 


0.707 


0.692 


1.090 


5.332 


0.706 


3.405 


3.489 


3.339 


3.344 


19.233 


3.778 


1.822 


4.550 


73.843 


32.534 


0.603 


0.585 


0.720 


4.977 


0.684 


2.425 


1.809 


1.832 


3.500 


31.333 


2.611 


1.009 


3.870 


73.652 


21.867 


0.422 


0.397 


0.392 


3.835 


0.632 


4.512 


6.240 


5.952 


2.844 


12.378 


5.800 


3.376 


5.131 


71.025 


45.335 


0.750 


0.735 


1.260 


6.435 


0.701 


3.947 


3.530 


3.338 


3.322 


18.756 


3.567 


1.914 


4.598 


71.096 


32.443 


0.568 


0.550 


0.779 


5.846 


0.675 


2.434 


1.511 


1.558 


3.633 


35.533 


1.967 


0.873 


3.821 


72.116 


18.966 


0.335 


0.311 


0.362 


3.984 


0.611 


4.748 


5.087 


4.733 


3.156 


14.478 


4.589 


2.752 


5.060 


70.186 


39.616 


0.665 


0.649 


1.107 


6.883 


0.690 


4.513 


3.462 


3.197 


3.444 


19.511 


3.356 


1.892 


4.716 


69.695 


31.321 


0.498 


0.481 


0.844 


6.691 


0.674 


3.503 


2.153 


2.053 


3.789 


28.522 


2.244 


1.272 


4.389 


68.169 


22.781 


0.347 


0.326 


0.569 


5.409 


0.647 


3.593 


5.033 


4.844 


2.122 


13.900 


4.878 


2.706 


3.796 


72.395 


48.016 


0.722 


0.706 


0.990 


5.211 


0.689 


3.252 


3.466 


3.367 


2.122 


17.944 


3.700 


1.720 


3.111 


76.558 


42.664 


0.626 


0.607 


0.707 


4.787 


0.679 


2.463 


1.956 


2.004 


2.233 


27.678 


2.578 


1.102 


2.540 


73.381 


32.383 


0.472 


0.446 


0.415 


3.910 


0.630 


4.274 


6.108 


5.818 


2.056 


11.944 


5.422 


3.323 


4.072 


71.002 


52.262 


0.751 


0.739 


1.235 


6.098 


0.701 


3.678 


3.330 


3.228 


2.089 


17.622 


3.389 


1.907 


3.066 


70.252 


41.724 


0.593 


0.573 


0.722 


5.554 


0.662 


2.517 


1.644 


1.714 


2.256 


30.967 


2.133 


1.039 


2.417 


69.184 


29.607 


0.383 


0.361 


0.385 


4.126 


0.610 


4.684 


5.823 


5.522 


2.122 


12.089 


4.989 


3.140 


4.038 


71.271 


50.870 


0.732 


0.721 


1.226 


6.673 


0.702 


4.641 


4.002 
1.828 


3.714 
1.796 


2.456 
2.689 


16.144 


3.589 


2.188 


3.829 


70.354 


40.714 


0.591 


0.574 


0.928 


6.728 


0.690 


3.038 


29.767 


1.989 


1.061 


3.001 


70.279 


26.984 


0.347 


0.327 


0.470 


4.895 


0.621 


3.763 


5.126 


4.924 


1.400 


12.744 


4.656 


2.687 


2.541 


73.612 


62.921 


0.733 


0.721 


1.030 


5.426 


0.694 


3.342 


3.344 


3.318 


1.567 


16.933 


3.456 


1.926 


2.324 


70.932 


50.601 


0.631 


0.611 


0.698 


5.073 


0.659 


2.526 


1.940 


2.053 


1.378 


25.944 


2.600 


1.108 


1.519 


74.649 


46.588 


0.476 


0.453 


0.421 


4.122 
6.252 


0.613 


4.411 


6.244 


5.938 


1.522 


11.689 


5.411 


3.383 


3.193 


70.782 


61.595 


0.749 


0.740 


1.282 


0.705 


4.029 


3.717 


3.559 


1.178 


14.933 


3.844 


2.128 


1.917 


69.488 


61.187 


0.628 


0.609 


0.817 


6.035 


0.667 


2.815 


1.886 


1.964 


1.378 


26.333 


2.122 


0.946 


1.394 


79.851 


48.676 


0.420 


0.398 


0.448 


4.618 


0.609 


4.785 


5.528 


5.203 


1.611 


11.756 


5.100 


3.052 


2.948 


69.998 


59.904 


0.717 


0.704 


1.190 


6.888 


0.694 


4.443 


3.882 


3.679 


1.933 


15.578 


3.556 


2.121 


3.038 


70.939 


46.667 


0.612 


0.597 


0.886 


6.553 


0.678 


3.151 


2.010 


2.054 


1.656 


25.367 


2.333 


1.224 


1.859 


69.686 


41.716 


0.409 


0.390 


0.500 


5.121 


0.615 



A principle component analysis is performed with the 
purpose of obtaining a small number of linear combinations 
of the 15 variables which account for most of the variability 
in the data. From the eigenvalue table (Table 10.5) as well 
as the scree plot shown (Fig. 10.6), one notes that there are 
three components with eigenvalues greater than or equal to 
1.0, and that together they account for 95.9% of the variabi- 
lity in the original data. Hence, it is safe to only retain three 
components. 

The equations of the principal components can be deduced 
from the table of components weights shown (Table 10.6). 
For example, the first principal component has the equation 

PCI = 0.268037 * CFl + 0.215784 * CFIO + 0.294009 

* CFl 1 + 0.29512 * CF12 + 0.302855 * CF13 + 0.247658 

* CF14 + 0.29098 * CF15 + 0.302292 * CF2 + 0.301159 

* CF3 - 0.06738 * CF4 - 0.297709 * CF5 + 0.297996 

* CF6 + 0.301394 * CF7 + 0.123134 * CF8 - 0.0168 * CF9 



Table 10.5 


Eigenvalue table for Example 10.3.3 




Component 


Eigenvalue 


Percent of cumulative 


number 


Variance 


Percentage 


1 


10.6249 


70.833 


70.833 


2 


2.41721 


16.115 


86.947 


3 


1.34933 


8.996 


95.943 


4 


0.385238 


2.568 


98.511 


5 


0.150406 


1.003 


99.514 


6 


0.0314106 


0.209 


99.723 


7 


0.0228662 


0.152 


99.876 


8 


0.00970486 


0.065 


99.940 


9 


0.00580352 


0.039 


99.979 


10 


0.00195306 


0.013 


99.992 


11 


0.000963942 


0.006 


99.998 


12 


0.000139549 


0.001 


99.999 


13 


0.0000495725 


0.000 


100.000 


14 


0.0000261991 


0.000 


100.000 


15 


0.0000201062 


0.000 


100.000 



where the values of the variables in the equation are stan- 
dardized by subtracting their means and dividing by their 
standard deviations. ■ 



Thus, in summary, PCA takes a group of "n" original re- 
gressor variables and re-expresses them as another set of "n" 
artificial variables, each of which represents a linear combi- 
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Fig. 10.6 Scree plot of Table 10.5 data 



Table 10.6 Component 


weights for Example 


10.3.3 




Component 1 


Component 2 


Component 3 


CFl 


0.268037 


0.126125 


-0.303013 


CFIO 


0.215784 


-0.447418 


-0.0319413 


CFll 


0.294009 


-0.0867631 


0.155947 


CF12 


0.29512 


-0.0850094 


0.149828 


CF13 


0.302855 


0.0576253 


-0.0135352 


CF14 


0.247658 


0.117187 


-0.388742 


CF15 


0.29098 


0.133689 


0.08111 


CF2 


0.302292 


0.0324667 


0.0801237 


CF3 


0.301159 


0.0140657 


0.0975298 


CF4 


-0.06738 


0.620147 


0.100548 


CF5 


-0.297709 


0.0859571 


0.00911293 


CF6 


0.297996 


0.0180763 


0.130096 


CF7 


0.301394 


0.0112407 


-0.0443937 


CF8 


0.123134 


0.576743 


0.151909 


CF9 


-0.0168 


-0.0890017 


0.796502 



nation of the original variables. This transformation retains 
all the information found in the original regressor variables. 
These indices, known as principal components (PCs) have 
several useful properties: (i) they are uncorrelated with one 
another, and (ii) they are ordered so that the first PC explains 
the largest proportion of the variation of the original data, the 
second PC explains the next largest proportion, and so on. 
When the original variables are highly correlated, the varian- 
ce of many of the later PCs will be so small that they can be 
ignored. Consequently, the number of regressor variables in 
the model can be reduced with little loss in model goodness- 
of-fit. The same reduction in dimensionality also removes 
the collinearity between the regressors and would lead to 
more stable parameter estimation and robust model identifi- 
cation. The coefficients of the PCs retained in the model are 
said to be more stable and, when the resulting model with 
the PC variables is transformed back in terms of the original 
regressor variables, the coefficients are said to offer more 
realistic insight into how the individual physical variables 
influence the response variable. 



If the principal components can be interpreted in physical 
terms, then it would have been an even more valuable tool. 
Unfortunately, this is often not the case. Though it has been 
shown to be useful in social sciences as a way of finding 
effective combinations of variables, it has had limited suc- 
cess in the physical and engineering sciences. Draper and 
Smith (1981) caution that PCA may be of limited usefulness 
in physical engineering sciences contrary to social sciences 
where models are generally weak and numerous correla- 
ted regressors tend to be included in the model. Reddy and 
Claridge (1994) conducted synthetic experiments in an ef- 
fort to evaluate the benefits of PCA against multiple linear 
regression (MLR) for modeling energy use in buildings, and 
reached the conclusion that only when the data is poorly ex- 
plained by the MLR model and when correlation strengths 
among regressors were high, was there a possible benefit to 
PCA over MLR; with, however, the caveat that injudicious 
use of PCA may exacerbate rather than overcome problems 
associated with multi-collinearity. 



10.3.3 Ridge Regression 

Another remedy for ill-conditioned data is to use ridge re- 
gression (see for example, Chatterjee and Price 1991; Draper 
and Smith 1981). This method results in more stable estima- 
tes than those of OLS in the sense that they are more robust, 
i.e., less affected by slight variations in the estimation data. 
There are several alternative ways of defining and computing 
ridge estimates; the ridge trace is perhaps the most intuitive. 
It is best understood in the context of a graphical representa- 
tion which unifies the problems of detection and estimation. 
Since the determinant of (X'X) is close to singular, the ap- 
proach involves introducing a known amount of "noise" via 
a variable k, leading to the determinant becoming less sen- 
sitive to multicollinearity. With this approach, the parameter 
vector for OLS is given by: 

bRidge = (X'X + k.I)-iX'Y (10.20) 

where I is the identity matrix. 

Parameter variance is then given by 

^^r{b)RUge = cr\X'X + k.I)-'x'X(X'X + k.I)"' (10.21a) 
with prediction bands: 

var(yo)a</g, = a^{l + X;,[(X'X + k.I)-iX'X(X'X + kiy^X^,] 

(10.21b) 
where o^ is the mean square error of the residuals. 

It must be pointed out that ridge regression should be 
performed with standardized variables (i.e., the individuals 
observations subtracted by the mean and divided by the stan- 
dard deviation) in order to remove large differences in the 
numerical values of the different regressors. 
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Fig 1 0.7 The optimal value of the ridge factor k is the value for which 
the mean square error (MSE) of model predictions is minimum. k=0 
corresponds to OLS estimation. In this case, k=0.131 is optimal with 
MSE=3.78xlO'' 



One increases the values of the "jigghng factor" k from 
(the OLS case) to 1 .0 to determine the most optimum value 
of k which yields the least model mean square error (MSE). 
This is often based on the cross-validation or testing data 
set (see Sect. 5.2.3d) and not for the training data set from 
which the model was developed. Usually the value of k is the 
range 0-0.2. This is illustrated in Fig. 10.7. The ridge esti- 
mators are biased but tend to be stable and (hopefully) have 
smaller variance than OLS estimators despite the bias (see 
Fig. 10.8). Forecasts of the response variable would tend to 
be more accurate and the uncertainty bands more realistic. 
Unfortunately, many practical problems exhibit all the classi- 
cal signs of multicollinear behavior but, often, applying PCA 
or ridge analysis does not necessarily improve the prediction 
accuracy of the model over the standard multi-linear OLS 
regression (MLR). The case study below illustrates such an 
instance. 



OLS estimate 



RR estimate 



True value 



Biased estimate 



Central estimate 

Fig. 10.8 Ridge regression (RR) estimates are biased compared to 
OLS estimates but the variance of the parameters will (hopefully) be 
smaller as shown 



1 0.3.4 Chiller Case Study Analysis Involving 
Collinear Regressors 

This section describes a study (Reddy and Andersen 2002) 
where the benefits of ridge regression compared to OLS are 
evaluated in terms of model prediction accuracy in the fra- 
mework of steady state chiller modeling. Hourly field moni- 
tored data (consisting of 810 observations) of a centrifugal 
chiller included four variables: (i) Thermal cooling capacity 
Qch in kW; (ii) Compressor electrical power P in kW; (iii) 
Supply chilled water temperature r^/,, in K, and (iv) Cooling 
water inlet temperature Tcdi in K. 

The data set has to be divided into two sub-sets: (i) a trai- 
ning set, which is meant to compare how different model 
formulations fit the data, and (ii) a testing (or validating) data 
set, which is meant to single out the most suitable model in 
terms of its predictive accuracy. The intent of having training 
and testing data sets is to avoid over-fitting and obtaining 
a more accurate indication of the model prediction errors, 
which is why a model is being identified in the first place 
(see Sect. 5.3.2d). 

Given a data set, there are numerous ways of selecting 
the training and testing data sets. The simplest is to split the 
data time-wise (containing the first 550 data points or about 
2/3rd of the monitored data) and a testing set (containing the 
second 260 data points or about l/3rd of the monitored data). 
It is advisable to select the training data set such that the 
range of variation of the individual variables is larger than 
that of the same variables in the testing data set, and also 
that the same types of cross-congelations among variables 
be present in both data sets. This avoids the issue of mo- 
del extrapolation errors interfering with the model building 
process. However, if one wishes to specifically compare the 
various models in terms of their extrapolation accuracy, i.e., 
their ability to predict beyond the range of their original va- 
riation in the data used to identify the model, the training and 
testing data set can be selected in several ways. An extreme 
form of data separation is to sort the data by Q^^^ and select 
the lower 2/3rd portion of the data for model development 
and the upper l/3rd for model evaluation. The results of such 
an analysis are reported in Reddy and Andersen (2002) but 
omitted here since the evaluation results were similar (indi- 
cative of a robust model). 

Pertinent descriptive statistics for both sets are given in 
Table 10.7. Note that there is relatively little variation in 
the two temperature variables, while the cooling load and 
power experience important variations. Further, as stated 
earlier, the range of variation of the variables in the testing 
data set are generally within those of the training data set. 
Another issue is to check the correlations and serial corre- 
lations among the variables. This is shown in Table 10.8. 
Note that the correlation between (P, T ^ ) and (Q ^, T J) are 
somewhat different during the training and testing data sets 
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Table 10.7 Descriptive statistics for the chiller data 





Training data set 
(550 data points) 

^^ch chi 


cdi 


Testing data set 
(260 data points) 

^ch chi 


cdi 


Mean 


222 1,108 285 


302 


202 1,011 


288 


302 


Std. 
Dev. 


37.8 282 2.43 


0.73 


14.7 149 


2.97 


0.77 


Max 


154 517 282 


299 


163 630 


283 


297 


Min 


340 1,771 292 


304 


230 1, 241 


293 


303 


Table 1 0.8 Correlation matrices for the training and testing data sets 




Training data set (550 data points) 








P 


Qc 


chi 


cdi 




P 


- 


0.98 


0.52 


0.57 




Qc 


0.99 


- 


0.62 


0.59 




T^ 


0.86 


0.91 


- 


0.37 




cdi 


0.54 


0.61 


0.54 


- 




Testing data set (260 data pomts) 



Table 10.9 CoiTclation coefficient matrix for regressors of the GN 
model during training 



(the correlation seems to have increased from around 0.5 to 
about 0.9); one has to contend with this difference. In terms 
of collinearity, note that the correlation coefficient between 
(P, Q^j^) is very important (0.98) while the others are about 
0.6 or less, which is not negligible but not very significant 
either. A scatter plot of the thermal load against COP is 
shown in Fig. 10.9. 

Two different steady-state chiller thermal performance 
models have been evaluated. These are the black-box model 
(referred to as MLR) and the grey-box model referred to as 
GN model. These models and their functional forms prior 
to regression are described in Pr. 5.13 of Chap. 5. Note that 
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Fig. 10.9 Scatter plot of chiller COP against thermal cooling load Q^^^ 
in kW. High values of Q,,^ ai'e often encountered during hot weather 
conditions at which times condenser water temperatures tend to be hig- 
her, which reduce chiller COP. This effect partly explains the leveling 
oif and greater scatter in the COP at higher values of thermal cooling 
load 





Training data set (550 data points) 






Y X, 


X. 


X3 


Y 


1.0 0.93 


0.82 


-0.79 


X 


1.0 


0.96 


-0.95 


X. 




1.0 


-0.92 


X3 






1.0 



while the MLR model uses the basic measurement variables, 
the GN model uses transformed variables (x^, x^, x^). It must 
be pointed out that ridge regression should be performed 
with standardized variables in order to remove large diffe- 
rences in the numerical values of the different regressors. 

A look at the descriptive statistics for the regressors in 
the physical model provides an immediate indication as to 
whether the data set may be ill-conditioned or not. One notes 
that the magnitudes of the three variables differ by several 
orders, but since ridge regression uses standardized variab- 
les, this is not an issue. The estimated correlation matrix for 
the regressors in the GN model is given in Table 10.9. It is 
clear that there is evidence of strong multi-collinearity bet- 
ween the regressors. 

From the statistical analysis it is found that the GN phy- 
sical model fits the field — monitored data well except per- 
haps at the very high end (see Fig. 10.10a). The adjusted 
R2=99.i%, and the coefficient of variation (CV) = 1.45%. 
An analysis of variance also shows that there is no statisti- 
cal evidence to reduce the model order The model residu- 
als have constant variance, as indicated by the studentized 
residual plots versus time (row number of data) and by re- 
gressor variable (Fig. 10.10b, c). Further, since most of them 
are contained within bounds of 2.0, one need not be unduly 
concerned with influence points and outlier points though a 
few can be detected. 

The regressors are strongly correlated (Table 10.9). The 
condition number of the matrix is close to 81 suggesting that 
the data is ill-conditioning, since this value is larger than the 
threshold value of 30 stated earlier. In an effort to overcome 
this adversity, ridge regression is performed with the ridge 
factor k varied from (which is the OLS case) to k=0.2. The 
ridge trace for individual parameters is shown in Fig. lO.lOd 
and the VIF factors are shown in Table 10.10. 

As stated earlier, OLS estimates will be unbiased, while 
ridge estimation will be biased but the estimation is likely 
to be more efficient. The internal and external predictive 
accuracy of both estimation methods using the training and 
testing data sets respectively were evaluated. It is clear from 
Table 10.10 that, during model training, CV values increase 
as k is increased, while the VIF values of the parameters de- 
crease. Hence, one cannot draw any inferences about whet- 
her ridge regression is better than OLS and if so, which value 
of ridge parameter k is optimal. Adopting the rule-of-thumb 
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Fig. 10.10 Analysis of chil- 
ler data using the GN model 
(gray-box model), a. x-y plot 
b. Residual plot versus time 
sequence in which data was 
collected c. Residual plot versus 
predicted response d. Variance 
inflation factors for the regressor 
variables 
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0.04 0.06 
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0.08 
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80.2 



Table 1 0.1 Results of ridge regression applied 
and NMBE than those from ridge regression 


to the GN model 


Despite strong 


coUinearity among 


the regressors, the OLS model has lower CV 




Training data set 










Testing 


data set 






Adj-R- 


CV(%) 


VIF(X,) 


VIF(X,) 


VIF(X3) 


CV(%) 




NMBE(%) 


k=0.0" 


99.1 


1.45 


18.43 


12.69 


9.61 


1.63 




-1.01 


k=0.02 


89.8 


3.02 


7.67 


6.39 


5.71 


2.86 




1.91 


k=0.04 


85.1 


4.16 


4.22 


3.99 


3.87 


4.35 




3.18 


k=0.06 


82.3 


4.87 


2.69 


2.78 


2.82 


5.20 




3.86 



5.32 



1.88 



2.07 



2.17 



5.73 



" Equivalent to OLS estimation 



4.27 



302 



1 Parameter Estimation IVlethods 



that the lower bound for the VIF values should be 5 would 
suggest k=0.02 or 0.04 to be reasonable choices. The CV 
and normalized mean bias error (NMBE) values of the mo- 
dels for the testing data set are also shown in Table 10.10. 
For OLS (k=0), these values are 1.63 and - 1.01% indicating 
that the identified OLS model can provide extremely good 
predictions. Again, both these indices increase as the value 
of k is increased indicating poorer predictive ability both in 
variability or precision and in bias. Hence, in this case, even 
though the data is ill-conditioned, the OLS identification is 
the better estimation approach if the chiller model identified 
is to be used for predictions only. 

The MLR model is a black-box model with linear second 
order terms in the three regressor variables (Eq. 5.75). Some 
or many of the variables may be statistically insignificant, 
and so a step-wise OLS regression was performed. Both for- 
ward selection and backward elimination techniques were 
evaluated using the F-ratio of 4 as the cut-off. While the 
backward elimination retained seven terms (excluding the 
constant), forward selection only retained three. The Adjus- 
ted — R- and CV statistics were almost identical and so the 
forward selection model is retained for parsimony. 

The final MLR model contains the three following variab- 
les: [Qch^, Tcdi*Q,,,i, Tchi*Qch]- The fit is again excellent 
with adjusted R2=99.2% and CV = 0.95% (very shghtly bet- 
ter than those of the GN model). An analysis of variance also 
shows that there is no statistical evidence to reduce the mo- 
del order, while the residuals are well-behaved. Unfortunate- 
ly, the regressors are very strongly correlated (the correlation 
coefficients for all three variables are 0.99) and indicates ill- 
conditioned data. The condition number of the matrix is very 
large (€^ = 76), again suggesting ill-conditioning. 

The ridge regression results for the MLR model are shown 
in Table 10.11. How the CV values, during model training, 
increase as the ridge factor k is increased from to 0.02 can 
be noted along with the VIF of the regressors. The CV and 
NMBE values of the models for the testing data set are also 
assembled. Again, despite the strong ill-conditioning of the 
OLS model, the OLS model (with k=0) turns out to be the 
best predictive model, with internal and external CV values 
of 1.13 and 0.62% respectively, which are excellent. 



10.3.5 Stagewise Regression 

Another approach that offers the promise of identifying 
sound parameters of a linear model whose regressors are 
correlated is stagewise regression (Draper and Smith 1981). 
It was extensively used prior to the advent of computers. The 
approach, though limited in use nowadays, is still a useful 
technique to know, and can provide some insights into model 
structure. Consider the following multivariate linear model 
with collinear regressors: 



'iXl 



(10.22) 



(b) 



(c) 



The basic idea is to perform a simple regression with one 
regressor at a time with the order in which they are selec- 
ted depending on their correlation strength with the response 
variable. This strength is re-evaluated at each step. The al- 
gorithm describing its overall methodology consists of the 
following steps: 

(a) Compute the correlation coefficients of the response va- 
riable y against each of the regressors, and identify the 
strongest one, say Xj. 

Perform a simple OLS regression of y vs Xi, and com- 
pute the model residuals u. This becomes the new re- 
sponse variable. 

From the remaining regressor variables, identify the 
one most strongly correlated with the new response va- 
riable. If this is represented by Xj, then regress ii vs Xj 
and recompute the second stage model residuals, which 
become the new response variable. 
Repeat this process for as many remaining regressor va- 
riables as are significant. 

The final model is found by rearranging the terms of 
the final expression into the standard regression model 
form, i.e., with y on the left-hand side and the signifi- 
cant regressors on the right-hand side. 
The basic difference between this method and the for- 
ward stepwise multiple regression method is that in the for- 
mer method the selection of which regressor to include in 
the second stage depends on its correlation strength with the 
residuals of the model defined in the first stage. Stepwise 
regression selection is based on the strength of the regres- 



(d) 
(e) 



86.3 



Table 10.1 1 Results of ridge regression applied to the MLR model. Despite strong coUinearity among 
CV and NMBE than those from ridge regression 


the 


regressors, 


the OLS model has lower 




Training data set 












Testing data set 






Adj-R- 


CV(%) 


VIF(X,) 


VIF(X,) 


VIF(X3) 


CV(%) 




NMBE(%) 


k=0.0" 


99.2 


0.95 


49.0 


984.8 


971.3 




1.13 




0.62 


Ii=0.005 


93.1 


1.81 


26.3 


15.0 


15.2 




2.86 




2.38 


li=0.01 


89.9 


2.39 


16.4 


6.43 


6.57 




3.52 




2.89 


k=0.015 


87.8 


2.80 


11.2 


3.89 


4.00 




3.95 




3.21 



3.10 



8.17 



2.69 



2.75 



4.25 



" Equivalent to OLS estimation 



3.43 
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sor with the response variable. Stagewise regression is said 
to be less precise, i.e., larger RMSE; however, it minimizes, 
if not eliminates, the effect of correlation among variables. 
Simplistically interpreting the individual parameters of the 
final model as the relative influence which these have on the 
response variable is misleading since the regressors which 
enter the model earlier pick up more than their due share at 
the expense of those which enter later. Though Draper and 
Smith (1981) do not recommend its use for typical problems 
since the true OLS is said to provide better overall predic- 
tion, this approach, under certain circumstance, is said to 
yield realistic and physically-meaningful estimates of the 
individual parameters. 

There are several studies in the published literature which 
use the basic concept behind stagewise regression without ex- 
plicitly identifying their approach as such. For example, it has 
been used in the framework of controlled experiments where 
the influence of certain regressors are blocked by either con- 
ducting tests during certain times when the physical regress- 
ors have no influence (such as doing tests at night in order 
to eUminate solar effects). This allows a partial model to be 
identified which is gradually expanded to identifying para- 
meters of the full model. For example, Saunders et al. (1994) 
modeled the dynamic performance of a house as a 2R1C (two 
resistors and one capacitor) electric network with five physical 
parameters. A stagewise approach was then adopted to infer 
these from controlled tests done during one night only. The 
Primary and Secondary Terms Analysis and Renormalization 
(PSTAR) method (Subbarao 1988) has a similar objective but 
is more versatile and accurate. It uses a detailed forward simu- 
lation model of the building in order to get realistic estimates 
of the various thermal flows, and identify the influential ones 
depending on the specific circumstance. In order to keep the 
statistical estimation robust, only the most important flows 
(usually three or four terms) are then corrected by introducing 
renormalization coefficients so that the predicted and measu- 
red thermal performance of the building are as close as pos- 
sible. Again, an intrusive experimental protocol allows these 
renormalization parameters to be deduced in stages requiring 
two nights and one daytime of testing. Thus, the analyst can 
effectively select periods when the influence of certain variab- 
les is small-to-negligible, and identify the model parameters in 
stages. This allows multi-collinearity effects to be minimized, 
and the physical interpretation of the model parameters retai- 
ned. This approach is illustrated below by way of a case study 
involving parameter estimation of a model for the thermal lo- 
ads of large commercial buildings. 



1 0.3.6 Case Study of Stagewise Regression 
Involving Building Energy Loads 

The superiority of stagewise regression as compared to OLS 
can be illustrated by a synthetic case study example (Reddy 



et al. 1999). Here, the intent was to estimate building and 
ventilation parameters of large commercial buildings from 
non-intrusive monitoring of its heating and cooling energy 
use from which the net load can be infen^ed. Since the stu- 
dy uses synthetic data (i.e., data generated by a commercial 
detailed hourly building energy simulation software), one 
knows the "correct" values of the parameters in advance, 
which allows one to judge the accuracy of the estimation 
technique. The procedure involves first deducing a macro- 
model for the thermal loads of an ideal one-zone building 
suitable for use with monitored data, and then using a mul- 
tistage linear regression approach to determine the model 
coefficients (along with their standard errors) which can be 
finally translated into estimates of the physical parameters 
(along with the associated errors). The evaluation was done 
for two different building geometries and building mass at 
two different climatic locations (Dallas, TX and Minneapo- 
lis, MN) using daily average and/or summed data so as to 
remove/minimize dynamic effects. 

(a) Model Formulation First, a steady state model for the 
total heat gains (Q^) was formulated in terms of variables that 
can be conveniently monitored. Building internal loads con- 
sist of lights and receptacle loads and occupant loads. Elec- 
tricity used by lights and receptacles {(\^^^ inside a building 
can be conveniently measured. Heat gains from occupants 
consisting of both sensible and latent portions and other ty- 
pes of latent loads are not amenable to direct measurement 
and are, thus, usually estimated. Since the schedule of lights 
and equipment closely follows that of building occupancy 
(especially at a daily time scale as presumed in this study), 
a convenient and logical manner to include the unmonitored 
sensible loads was to modify q^^^^ by a constant multiplicati- 
ve correction factor k which accounts for the miscellaneous 

s 

(i.e., unmeasurable) internal sensible loads. Also, a simple 
manner of treating internal latent loads was to introduce a 
constant multiplicative factor kj defined as the ratio of in- 
ternal latent load to the total internal sensible load (k^ ■ q^^j^) 
which appears only when outdoor specific humidity w^ is 
larger than that of the conditioned space. Assuming the sign 
convention that energy flows are positive for heat gains and 
negative for heat losses, the following model was proposed: 

Qb = qLRks(l + ki5)A + a'^oi + (b^oi + UA, + mvAcp) 

X (To - TJ + mvAhvS(wo - wj (10.23) 

where 

A Conditioned floor area of building 
A^ Surface area of building 
c Specific heat of air at constant pressure 

h Heat of vaporization of water 

k| Ratio of internal latent loads to total internal sensible 
loads of building 
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k^ Multiplicative factor for converting q^^^ to total inter- 
nal sensible loads 
m^ Ventilation air flow rate per unit conditioned area 
Qg Building thermal loads 
q^g^ Monitored electricity use per unit area of lights and 

receptacles inside the building 
Tjj Outdoor air dry-bulb temperature 
T Thermostat set point temperature 
U Overall building shell heat loss coefficient 
W^ Specific humidity of outdoor air 
W^ Specific humidity of air inside space 
S is an indicator variable which is 1 when w^ > w^ and 
otherwise. 
The effect of solar loads is linearized with outdoor tempe- 
rature Tjj and included in the terms a'^^^ and b^^^. The expres- 
sion for Qg given by Eq. 10.23 includes six physical parame- 
ters: k^, kj, UAg, m^, T^ and w^. One could proceed to estimate 
these parameters in several ways. 

(b) One-Step Regression Approach One way to identify 
these parameters is to directly resort to OLS multiple li- 
near regression provided monitored data of q^^j^, T^ and w^ 
is available. For such a scheme, it is more appropriate to 
combine solar loads into the loss coefficient U and rewrite 
Eq. 10.23 as: 

Qb/A = a + b ■ qLR + c ■ (5 ■ qLR + d • To + e • (5 • (wo - wJ 

(10.24a) 

where the regression coefficients are: 

a = — (UAs/A + mv ■ Cp) ■ Tz b = ks c = kj ■ ki 



d = (UAs/A + mv ■ Cp) 



e = mv • hv 



(10.24b) 



Subsequently, the five physical parameters can be infer- 
red from the regression coefficients as: 

ks = b ki = c/b my — e/hy 

UAs/A = d - e • Cp/hy T^ = -a/d (10.25) 

The uncertainty associated with these physical parameters 
can be estimated from classical propagation of errors formu- 
lae discussed in Sect. 3.7, and given in Reddy et al. (1999). 

The "best" value of building specific humidity w could 
be determined by a search method: select the value of w that 
yields the best goodness-of-fit to the data (i.e., highest R^ or 
lowest CV). Since w has a more or less well known range of 
variation, the search is not particularly difficult. Prior studies 
indicated that the optimal value has a broad minimum in the 
range of 0.009-0.011 kg/kg. Thus, the choice of w was not 
a critical issue, and one could simply assume w =0.01 kg/kg 
without much error in subsequently estimating other para- 
meters. 



(c) Two-Step Regression Approach Earher studies based 
on daily data from several buildings in central Texas indicate 
that for positive values of (w^-w^) the variables (i) qL^and 
T„, (ii) q^^ and (w„-w^), and (iii) T„ and (w„-w^) are strong- 
ly correlated, and are likely to introduce bias in the estima- 
tion of parameters from OLS regression. It is the last set of 
variables which is usually the primary cause of uncertainty 
in the parameter estimation process. Two-step regression in- 
volves separating the data set into two groups depending on 
S being or 1 (with w^ assumed to be 0.01 kg/kg). During a 
two-month period under conditions of low outdoor humidity, 
5 = 0, and Eq. 10.24 reduces to 



Qb/A = a + b ■ qLR + d ■ To 



(10.26) 



Since q^^, and T^ are usually poorly correlated under such 
low outdoor humidity conditions, the coefficients b and d 
deduced from multiple linear regression are likely to be 
unbiased. For the remaining year-long data when 8 — \, 
Eq. 10.24 can be re-written as: 

Qb/A = a -I- (b + c) ■ qLR + d ■ To + e ■ (wo - Wz) 

(10.27) 

Now, there are two ways of proceeding. One variant is 
to use Eq. 10.27 as is, and determine coefficients a, (b-Hc), 
d and e from multiple regression. The previous values of a 
and d determined from Eq. 10.26 are rejected, and the pa- 
rameter b determined from Eq. 10.26 along with a, c, d and 
e determined from Eq. 10.27 are retained for deducing the 
physical parameters. This approach, termed 2-step variant A, 
may, however, suffer from the collinearity effects between T^ 
and (W||-w^) when Eq. 10.27 is used. 

A second variant of the two-step approach, termed 2-step 
variant B, would be to retain both coefficients b and d deter- 
mined from Eq. 10.26 and use the following modified equa- 
tion to determine a, c and e from data when S — I: 

Qb/A - d ■ To = a -I- (b + c) • qLR + e ■ (wo - Wz) (10.28) 

The collinearity effects between q^^^^ and (w^-w^) when 
5 = 1 are usually small, and this is likely to yield less un- 
biased parameter estimates than variant A. 

(d) Multi-stage Regression Approach The multi-stage ap- 
proach involves four steps, i.e., four regressions are perfor- 
med based on Eq. 10.24 as against only two in the two-stage. 
The various steps involved in the multi-stage regression are 
shown in Table 10.12. First, (Q^/A) is regressed against q^^j^ 
only, using a two-parameter (2-P) regression model to de- 
termine coefficient b. Next, the residual Y^ = (Qg/A - b ■ q^^j^) 
is regressed against T^ using either a two parameter (2-P) 
model or a four parameter (4-P) change point model (see 
Sect. 5.7.2 for explanation of these terms) to determine co- 
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Table 10.12 Steps involved in 
higher than that indoors 


the multi-stage regression approach using Eq. 10.24 


Days when S = 


1 correspond to 


days with outdoor humidity 




Dependent variable 
Qb/A 






Regressor 


variables 


Type of regression 


Parameter identified from 
Eq. 10.24 


Data set used 


Step 1 






V 




2-P 




b 




Entire data 


Step 2 


Y, = Q,/A-b.q^ 






T„ 




2-Por4-P 




d 




Entire data 


Step 3 


Y. = QB/A-b-q^- 


-d 


T„ 


V 




2-P 




c 




Data when S = 1 


Step 4 


Y, = QB/A-b.v- 


-d 


•To 


(W(, - w^) 




2-P or 4-P 




e 




Data when S = 1 



efficient d. Next, for data when S — 1, the new residual is 
regressed against q^^^^ and (w^ - w^) in turn to obtain coeffi- 
cients c and e respectively. Note that this procedure does not 
allow coefficient a in Eq. 10.24 to be estimated, and so T^ 
cannot be identified. This, however, is not a serious limitati- 
on since the range of variation of T is fairly narrow for most 
commercial buildings. The results of evaluating whether this 
identification scheme is superior to the other two schemes 
are presented below. 



(e) Evaluation A summary of how accurately the various 
parameter identification schemes (one-step, two-step variant 
A, two-step variant B, and the multistage procedures) are 
able to identify or recover the "true" parameters is shown in 
Fig. 10.11. Note that simulation runs R1-R5 contain the in- 
fluence of solar radiation on building loads, while the effect 
of this variable has been "disabled" in the remaining four 
computer simulation runs. The "true" values of each of the 
four parameters are indicated by a solid line, while the esti- 
mated parameters along with their standard errors are shown 



Identif. Scheme 
Location 
Building 
Solar (?) 



Rl 


R2 


R3 


R4 


R5 


R6 


R7 


R8 


R9 


1-stcp 


2-A 


2-B 


Multi. 


Multi. 


Multi. 


Multi. 


Multi. 


Multi. 


Dallas 


Dallas 


Dallas 


Dallas 


Dallas 


Dallas 


Minneap. 


Dallas 


Minneap. 


Bl 


Bl 


Bl 


Bl 


Bl 


Bl 


Bl 


B2 


B2 


Yes 


Yes 


Yes 


Yes 


Yes 


No 


No 


No 


No 



M 



tnv 
(kg^s-ni2) 



UAs/A 



1.5 

1.35 

1.2 
1.05 

0.9 

0.5 
0.35 

0.2 
0.05 

-0.1 

0.0011 

0.00085 

0.0006 
0.00035 

0.0001 

0.004 
0.0032 

0.0024 h 





■ — - — 
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' 


.. 


_ . 


.. 


-"■'- 


-s- 


-r- 


_.i._ 


. I 


_ .4i. . . 


...^. _ 


_.!., 


1 
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(kW/C-in2) 0.0016 

0.0008 





Fig. 10.11 Comparison of how the various estimation schemes (Rl- 
R9) were able to recover the "true" values of the four physical parame- 
ters of the model given by Eq. 10.24. The solid lines depict the correct 
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value while the mean values estimated for the various parameters and 
their 95% uncertainty bands are also shown 
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as small boxes. It is obvious that parameter identification is 
very poor for one-step and two-step procedures (Rl, R2 and 
R3) while, except for (UA^/A), the three other parameters are 
very accurately identified by the multistep procedure (R4). 
Also, noteworthy is the fact that the single step regression to 
daily Q^ values for buildings Bl and B2 at Dallas and Min- 
neapolis are excellent (R^ in the range of 0.97-0.99). So this 
by itself does not assure accurate parameter identification 
but seems to be a necessary condition for being able to do so. 

The remaining runs (R6-R9) do not include solar ef- 
fects and in such cases the multistep parameter identifica- 
tion scheme is accurate for both climatic types (Dallas and 
Minneapolis) and building geometry (Bl and B2). From 
Fig. 1 0. 1 1 , it is seen that though there is no bias in estimating 
the parameter m , there is larger uncertainty associated with 
this parameter than with the other four parameters. Finally, 
note that the bias in identifying (UA^/A) using the multistep 
approach when solar is present (R4 and R5) is not really an 
error: simply that the steady-state overall heat loss coeffi- 
cient has to be "modified" in order to implicitly account for 
solar interactions with the building envelope. 

A physical explanation as to why the multistage identi- 
fication scheme is superior to the other schemes (especially 
the two-step scheme) has to do with the cross-correlation of 
the regressor variables. Table 10.13 presents the correlation 
coefficients of the various variables, as well as variables Y^ 
and Y^ (see Table 10.12). Note that for both locations, q^^^^, 
because of the finite number of schedules under which the 
building is operated (5 day-types in this case such as week- 
day, weekends, holidays, ...) is the variable least correlated 
with Q as well as with the other regressor variables. Hence, 



Table 10.13 Correlation coefficient matrix of various parameters for 
the two cities selected at the daily time scale for Runs #6 and #7 (R6 
and R7) 









Dallas 












^B.l-zone 


Y, 


Y. 


qLR 


T„ 


w„. 


^LR-^ 


Qb,,.™= 




0.85 


0.52 


0.53 


0.88 


0.72 


0.82 


Y, 


0.88 




0.78 


0.00 


0.97 


0.80 


0.70 


Y^ 


-0.86 


-0.82 




-0.27 


0.59 


0.68 


0.44 


Ilr 


0.48 


0.01 


-0.30 




0.11 


0.07 


0.42 


To 


0.91 


0.97 


-0.93 


0.13 




0.75 


0.72 


w„. 


0.57 


0.59 


-0.40 


0.10 


0.54 




0.66 


qLR-^ 


0.66 


0.60 


-0.48 


0.27 


0.58 


0.72 




Minneapolis 



regressing Q^ with q^^^ is least likely to result in the regression 
coefficient of q^^^^ (i.e., b in Eq. 10.24) picking up the influen- 
ce of other regressor variables, i.e., the bias in the estimation 
of b is likely to be minimized. Had one adopted a scheme of 
regressing Q^ with T^ first, the correlation between Q^ and 
T„ as well as between T„ and (w„ - w ) for data when 8 = 1 
would result in coefficient d of Eq. 10.24 being assigned more 
than its due share of importance, thereby leading to a bias in 
UAj value (see Rl, R2 and R3 in Fig. 10.1 1), and, thus, unde- 
restimating kj. 

The regression of Q^ versus q^^^^ for R6 is shown in Fig. 10. 12a. 
The second step involves regressing the residual Y^ versus T^ be- 
cause of the very strong correlation between both variables (cor- 
relation coefficients of about 0.97-0.99, see Table 10. 13). Equal- 
ly good results were obtained by using step 3 and step 4 (see 
Table 10.12) in any order Step 3 (see Fig. 10.12c) allows identi- 



Fig. 10.12 Different steps 
in the stagewise regression 
approach to estimate the four 
model parameters "b, c, d, e" 
following Eq. 10.24 as described 
in Table 10.12. a. Estimation 
of parameter b b. Estimation 
of parameter d c. Estimation of 
parameter c d. Estimation of 
parameter e 
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fication of the regression coefficient c in Eq. 10.24 representing 
the building internal latent load, while step 4 (i.e., coefficient e 
of Eq. 10.24) identifies the corresponding regression coefficient 
associated with outdoor humidity (Fig. 10.12d). In conclusion, 
this case study illustrates how a multistage identification scheme 
has the potential to yield accurate parameter estimates by remo- 
ving much of the bias introduced in multiple linear regression 
approach with correlated regressor variables. 



10.3.7 Other Methods 

Factor analysis (Manly 2005) is similar to PCA in that a re- 
duction in dimensionality is sought by removing the redun- 
dancy from a set of correlated regressors. However, the dif- 
ference in this approach is that each of the original regressors 
is now reformulated in terms of a small number of common 
factors which impact all of the variables, and a set of errors or 
specific factors which affect only a single X variable. Thus, 
the regressor is modeled as: X =a,F,H-a J,H-a.,F,H-...H- e, 

^ 1 il 1 i2 2 |3 3 

where a are called the factor loadings, F. is the value of the 

11 J 

j* common factor, and e,- is the part of the test result speci- 
fic to the i"" variable. Factor rotation can be orthogonal or 
oblique to yield correlated factors. One can use PCA and 
set an initial estimate for the number of factors (equal to the 
number of eigenvalues which are greater than one), and drop 
factors which exhibit little or no explicative power. Thus, 
while PCA is not based on a model, factor analysis presu- 
mes a model where the data is made up of common factors. 
Factor analysis is frequently used for identifying hidden or 
latent trends or structure in the data whose effects cannot be 
directly measured. Like PCA, several authors are skeptical 
of factor analysis since it is somewhat subjective; however, 
others point out its descriptive benefit as a means of unders- 
tanding the causal structure of multivariate data. 

Canonical correlation analysis is an extension of multiple 
linear regression (MLR) for systems which have several re- 
sponse variables (Y) and a vector of Y is to be determined. 
Both the regressor set (X) and the response set (Y) are first 
standardized, and then represented by weighted linear com- 
binations U and V respectively, which are finally regressed 
against each other (akin to MLR regression) in order to yield 
canonical weights (or derived model parameters) which can 
be ranked. These canonical weights can be interpreted as beta 
coefficients (see Sect. 5.4.5) in that they yield insights into 
the relative contributions of the individual derived variables. 
Thus, the approach allows one to understand the relationship 
between and within two sets of variables X and Y. However, 
in many systems, the response variables Y are not all "equal" 
in importance, some may be deemed to be more influential 
than others based on physical insights of system behavior. 
This relative physical importance is not retained during the 



rotation since it is based purely on statistical criteria. Thus, 
this approach is said to be more relevant to the social and sof- 
ter sciences than in engineering and the hard sciences (though 
it is widely used in hydro-climatology). 

PCA and factor analyses are basically multivariate data 
analysis tools which involve analyzing the X'X matrix only; 
application to regression model building is secondary, if at 
all. Canonical regression uses both X'X and Y'Y, but still 
the regression is done only after the initial transformations 
are completed. A more versatile and flexible model identi- 
fication approach which considers the covariance structure 
between predictor and response variable during model iden- 
tification is called partial least squares (PLS) regression. It 
uses the cross-product matrix Y'XX'Y to identify the multi- 
variate model. It has been used in instances when there are 
fewer observations than predictor variables, and further is 
useful for exploratory analysis and for outlier detection. PLS 
has found applications in numerous disciplines where a large 
number of predictors are used, such as in economics, medi- 
cine, chemistry, psychology, .... Note that PLS is not really 
meant to understand the underlying structure of multivariate 
data (as do PCA and factor analysis), but as an accurate tool 
for predicting system response. 



1 0.4 Non-OLS Parameter Estimation Methods 

10.4.1 General Overview 

The general parameter estimation problem is discussed 
below, and some of the more commonly used methods are 
pointed out. The approach adopted in OLS was to minimi- 
ze an objective function (also referred to as the loss func- 
tion) expressed as the sum of the squared residuals (given 
by Eq. 5.3). One was able to derive closed form solutions 
for the model parameters and the uncertainties as well as 
the standard errors of regression (for mean response) and 
those for prediction (for individual response) under certain 
simplifying assumptions as to how noise corrupts measu- 
red system performance. Such closed form solutions cannot 
be obtained for many situations where the function to be 
minimized has to be framed differently, and these require 
the adoption of search methods. Thus, parameter estima- 
tion problems are, in essence, optimization problems where 
the objective function is framed in accordance with what 
one knows about the errors in the measurements and in the 
model structure. 

A model can be considered to be either linear or non-li- 
near in their parameters. The latter situation is covered in 
Sect. 10.5, while here the discussion is limited to linear es- 
timation problems. Figure 10.13 is a sketch depicting how 
en'ors influence the parameter estimation process. Errors 
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Fig. 10.13 a. Sketch and nomen- 
clature to explain the different 
types of parameter estimation 
problems for the simple case of 
one exploratory variable (x) and 
one response variable (y). Noise 
and errors are assumed to be 
additive, b. The idealized case 
without any errors or noise 
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which can corrupt the process can be viewed as either ad- 
ditive, multiplicative or mixed. For the simplest case, namely 
the additive error situation, such errors can appear as measu- 
rement error (k) on the regressor/exploratory variable, on 
the response variable (5), and also on the postulated model 
(e). Note from Fig. 10.13, that x* and y* are the true values 

of the variables while x and y are the measured values at ob- 

I -' 1 

servation i. Even when errors are assumed additive, one can 
distinguish between three broad types of situations: 
(a) Measurement errors iyi^^i) and model error (e,) may 
be: (i) Unbiased or biased, (ii) normal or non-normal- 
ly distributed along different vertical slices of x. values 
(see Fig. 5.3), (iii) variance may be zero, uniform, non- 
uniform over the range of variation of x; 
Covariance effects may, or may not, exist between 
the errors and the regressor variable, i.e. between 
{x,B,Y,s); 

Autocorrelation or serial correlation over different tem- 
poral lags may, or may not, exist between the errors and 
the regressor variables, i.e., between (x, 8, y, e). 
The reader is urged to refer back to Sect. 5.5.1, whe- 
re these conditions as applicable to OLS were stated. The 
OLS situation strictly applies when there is no error in x i.e., 
y = 0, when S % N{0,ag) i.e., unbiased, normally dis- 
tributed with constant variance, when cov(x,, 5,) — and 
when cov((5/,5,+i) = 0. Notice that it is impossible to se- 
parate the effects of 5, from e, in the OLS case, and so the 
combined effect is to increase the error variance of the fol- 
lowing model which will be reflected in the RMSE value: 



(b) 



(c) 



y =a+bx + (E + S) 



(10.29) 



Situations when the error in the x variable, i.e. 
Y ~ A'(0, a^) is non-negligible would need the error in 
variable (EIV) approach presented in Sect. 10.4.2. For 
such cases as when ^ is a known function, maximum li- 
kelihood estimation (MLE) is relevant and is discussed in 
Sect. 10.4.3. 



1 0.4.2 Error in Variables (EIV) and Corrected 
Least Squares 

Parameter estimation using OLS is based on the premise that 
the regressors are known without error (y = 0), or that their 
errors are very small compared to those of the response va- 
riable. There are situations when the above conditions are 
invalid (for example, when the variable x is a function of se- 
veral basic measurements, each with their own measurement 
errors), and this is when the error in variable (EIV) approach 
is appropriate, whereas OLS results in biased estimates. If 
the error variances of the measurement errors are known, the 
bias of the OLS estimator can be removed and a consistent 
estimator, called Corrected Least Squares (CLS) can be de- 
rived. This is illustrated in Fig. 10.14, taken from Andersen 
and Reddy (2002), which shows how the model parameter 
estimation using OLS gradually increases in bias as more 
noise is introduced in the x variable, while no such bias is 
seen for CLS estimates. However, note that the 95% uncer- 
tainty bands by both methods are about the same. 

CLS accounts for both the uncertainty in the regressor 
variables as well as that in the dependent variable by mini- 
mizing the distance given by the ratio of these two uncer- 
tainties. Let us consider the case of simple linear regression 
model: y — a + fix, and assume that the errors in the de- 
pendent variables and the errors in the independent variables 
are uncorrelated. The basis of the minimization scheme is 
described graphically by Fig. 10.15. Let us define a parame- 
ter representing the relative weights of the variance of mea- 
surements of X and y as: 

var(y) 



1 = 



var(5) 



(10.30) 



The loss function assumes the form (Mandel 1964): 

subject to the condition that yi — a + bx, and (yi,Xi) are 
the estimates of (y,, x,). Omitting the derivation, the CLS 
estimates turn out to be: 
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Fig. 10.14 Figure depicting how 
a physical parameter in a model 
(in this case the heat exchanger 
thermal resistance R of a chiller) 
estimated using OLS becomes 
gradually more biased as noise 
is introduced in the x vaiiable. 
No such bias is seen when EIV 
estimation is adopted. The uncer- 
tainty bands by both methods are 
about the same as the magnitude 
of the error is increased. (From 
Andersen and Reddy 2002) 
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Fig. 10.15 Differences in both 
slope and the intercept of a linear 
model when the pai'ameters are 
estimated under OLS or EIV. 
Points shown as "*" denote data 
points and the solid lines are 
the estimates of the two models. 
The dotted lines, which differ by 
angle 6 indicate the distances 
whose squared sum is being 
minimized in both approaches. 
(From Andersen and Reddy 
2002) 




Regression line estimated by the 
EIV method 



Regression line estimated by the 
OLS method 



Distance being minimized in OLS 
regression. Only the uncertainty in Y is 
accounted for 



Distance being minimized in EIV 
regression. Both uncertainty in X and Y 
are accounted for 



and 



{XSy, - S,,) + [{XSyy - S,,f + AXS^f^ 



a — y —b ■ X 



(10.31a) 
(10.32b) 



where S , S and S are defined by Eq. 5.6. 

XX yy xy -' -* 

The extension of the above equation to the multivariate 
case is given by Fuller (1987): 

bcLs = {X'X - S\,S\X'Y - S\„) (10.32c) 



where 5^^^ is a P ^ P matrix with the covariance of the 
measurement errors and S^xv is a /? x 1 vector with the co- 



variance between the regressor variables and the dependent 
variable (given by Eq. 5.6). 

A simple conceptual explanation is that Eq. 10.31 per- 
forms on the estimator matrix an effect essentially the op- 
posite of what ridge regression does. While ridge regression 
"jiggles" or randomly enhances the dispersion in the nume- 
rical values of the X variables in order to reduce the adverse 
effect of multi-collinearity on the estimated parameter bias, 
CLS tightens the variation in an attempt to reduce the effect 
of random error on the x variables. For the simple regression 
model, the EIV approach recommends that minimization or 
eiTors be done following angle (see Fig. 10.15) given by: 



tan (61) = 



(Sy/Sx) 



(10.33) 
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Simple Linear Regression with Errors in X 




Fig 



4 6 

(Stdev of Y/ Stdev of X) 
10.16 Plot of eq. (10.33) with b = 1 



12 



where s is the standard deviation. 

For the case of b= 1, one gets a curve such as Fig. 10.16. 
Note that when the ratio of the standard deviations is less 
than about 5, the angle theta varies little and is about 10°. 

As a rough rule of thumb, it is advocated that if the mea- 
surement uncertainty of the x variable characterized by the 
standard deviation is much less, (say, less than l/5th) than 
that in the response variable, then there is little benefit in 
applying EIV regression as compared to OLS. The 1/5 th 
rule has been suggested for the simple regression case, and 
should not be used for multi-regression with correlated re- 
gressors. The interested reader can refer to Beck and Arnold 
(1977) and to Fuller (1987) for more in-depth treatment of 
the EIV approach. 



10.4.3 Maximum Likelihood Estimation (MLE) 

The various estimation methods discussed earlier, to deduce 
point estimates of a sample or model parameters from a data 
set of observations, are based on moments of the data (the 
first moment is the mean, the second is the variance, and 
so on); hence, this approach is referred to as the Method of 
Moments Estimation (MME). Maximum Likelihood Estima- 
tion (MLE) is another approach, and is generally superior to 
MME since it can handle any type of error distribution, whi- 
le MME is limited to normally distributed errors (Pindyck 
and Rubinfeld 1981; Devore and Farnum 2005). MLE all- 
ows generation of estimators of unknown parameters that are 
generally more efficient and consistent than MME, though 
sometimes estimates can be biased. In many situations, the 
assumption of normally distributed eiTors is reasonable, and 
in such cases, MLE and MME give identical results. Thus, in 
that regard, MME can be viewed as a special (but important) 
case of MLE. 

Consider the following simple example meant to illustrate 
the concept of MLE. Suppose that a shipment of computers 
is sampled for quality, and that two out of five are found de- 



T3 
O 
O 




0.4 0.6 

Proportion of defectives 

Fig. 10.17 The maximum likelihood function for the case when two 
computers out of a sample of five are found defective 



fective. The common sense approach of estimating popula- 
tion proportion of defectives is tt = 2/5 — 0.40. One could 
use an alternative method by considering a whole range of 
possible 7T values. For example, if jr =0.1, then the proba- 
bility of s = 2 defectives out of a sample of n=5 observations 
would be given by the binomial formula: 



" ^ 7r'(l - jr)"-' = (2) 0- 1^(0-9)^ = 0.0729 



In other words, if tt = 0. 1, there is only 7.3% probability 
of getting the actual sample that was observed. However, if 
TT — 0.2, the chances improve since one gets p = 20.5%. By 
trying out various values, one can determine the best value of 
TT. How the maximum likelihood function varies for diffe- 
rent values of tt is shown in Fig. 10. 17 from which one finds 
that the most likely value is 0.4, the same as the common 
sense approach. Thus, the MLE approach is to simply deter- 
mine the value of tt which maximizes the likelihood func- 
tion given by the above binomial formula. In other words, 
the MLE is the population value that is more likely than any 
other value to generate the sample which was actually ob- 
served, or which maximizes the likelihood of the observed 
sample. 

The above approach can be generalized as follows. Sup- 
pose a sample (xi,X2,-.-x„) of independent observations is 
drawn from a population with probability function p{xj /O) 
where is the unknown population parameter to be esti- 
mated. If the sample is random, then the joint probability 
function for the whole sample is: 

p{xuX2, ...x„/d) = p{xi/e).p{x2/e) ■ ■ ■ p(x„/e) (10.34) 

The objective is now to estimate the most likely value of 
among all its possible values which maximizes the above 
probability function. The likelihood function is thus: 

n 

L(0/xi,...x„) = ]~[p(x,/(?) (10.35a) 
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where Yl denotes the product of n factors. 

The parameter is easily determined by taking natural 
logarithms, in which case, 

n 

hi(L(e)) = J2^np{xi/9) (10.35b) 

i = l 

Thus, one could determine the MLE of by performing 
a least-squares regression in case the probability function 
is known or assumed. Say, one wishes to estimate the two 
regression parameters Pq and Pi of a simple linear model 
assuming the error distribution to be normal with variance 
a (Sect. 2.4.3a), the probability distribution of the residuals 
would be given by: 






1 

"2ct 



^{yi - pQ - pyXiY 



(10.36) 



where (x,, yO are the individual sample observations. The 
maximum likelihood function is then: 



L{y\,y2,-..,y„,PQ,P\,(y ) = p{y\) ■ piyi) ■ ■ ■ p{yn) 



1 






1 
'la- 



,{yi -Po- PiXif 



(10.37) 



The three unknown parameters (/Jo, ySi,cr) can be deter- 
mined either analytically (by setting the partial derivatives to 
zero) or numerically which can be done by most statistical 
software programs. For this case, it can be shown that MLE 
estimates and OLS estimates of /Jq and P\ are identical, 
while those for ct^ is biased (though consistent). 

The advantages of MLE go beyond its obvious intuitive 
appeal: 

(a) though biased for small samples, the bias reduces as the 
sample size increases, 

(b) where MLE is not the same as MME, the former is gene- 
rally superior in terms of yielding minimum variance, 

(c) MLE is very straightforward and can be easily solved by 
computers, 

(d) in addition to providing estimates, MLE is useful to 
show the range of plausible values for the parameters, 
and also for deducing confidence limits. 

The main drawback is that MLE may lack robustness in 
dealing with a population of unknown shape, i.e., it cannot 
be used when one has no knowledge of the underlying error 
distribution. In such cases, one can evaluate the goodness of 
fit of different probability distributions using the Chi-square 
criterion (see Sect. 4.2.6), identify the best candidates and 
pick one based on some prior physical insights. The com- 
putation is easily done on a computer, but the final selection 
is to some extent at the discretion of the analyst. MLE is 
often used to estimate parameters appearing in probability 



Table 10.14 Data table for Example 10.4.1 






2,100 2,107 2,128 2,138 


2,167 


2,374 


2,412 2,435 2,438 2,456 


2,596 


2,692 


2,738 2,985 2,996 3,369 



distribution functions when sample data is available. The fol- 
lowing examples illustrate the approach. 

Example 10.4.1: MLE for exponential distribution 
The lifetime of several products and appliances can be de- 
scribed by the exponential distribution (see Sect. 2.4. 3e) gi- 
ven by the following one parameter model: 

E(x;A) = X ■ e-'^^ if x > 
= otherwise 

Sixteen appliances have been tested and operating life 
data in hours are assembled in Table 10.14. The parameter 
X is to be estimated using MLE. 

The likelihood function 

n 

HX/xi) = Y\ Pi^'/^) = (le-'^'')(le-'-^-^)(Ae-^-''^) . . . 
i=\ 
= X"e-^^'< 

Taking logs, one gets In {L(X)) — n\n{X) — X'^ x. 
Differentiating with respect to X and setting it to zero 
yields: 



d[n(L{X)) n 



dX 



X 



Vj Xj = 0, from which X • 



E-^i 



1 

X 



Thus the MLE estimate of the parameter 1 = (2508. 1 88)" ' = 
0.000399. ■ 

Example 10.4.2: MLE for Weibull distribution 
The observations assembled in Table 10.15 are values of 
wind speed (in m/s) at a certain location. The Weibull dis- 
tribution (see Sect. 2.4.3f) with parameters (a, /S) is appro- 
priate for modeling wind distributions: 



p{x) 



pa' 



__ya-ig-(.v/^r 



Estimate the values of the two parameters using MLE. 

As previously, taking the partial derivatives of 
In {L(a, P)), setting them to zero, and solving for the two 
equations results in (Devore and Famum 2005): 



Table 1 0.1 5 Data table for Example 10.4.2 



4.7 


5.8 


6.5 


6.9 


7.2 


7.4 


7.7 


7.9 


8.0 


8.1 


8.2 


8.4 


8.6 


8.9 


9.1 


9.5 


10.1 


10.4 
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Fig. 10.18 Fit to data of the WeibuU distribution witli MLE parameter 
estimation (Example 10.4.2) 




100 
Years from 2008 



200 



Fig. 10.19 Exponential and logistic growth curves for annual woridwi- 
de primary energy use assuming an initial annual growth rate of 2.4% 
and under two different carrying capacity values k (Example 10.4.3) 






and jS 



E^r 



l/a 



This approach is tedious and error-prone, and so one 
would tend to use a computer program to perform MLE. 
Resorting to this option resulted in MLE parameter estima- 
tes of (a = 7.9686, ,6 = 0.833). The goodness-of-fit of the 
model can be evaluated using the Chi-square distribution 
(see Sect. 2.4.10) which is found to be 0.222. The resulting 
plot and the associated histogram of observations are jointly 
shown in Fig. 10.18. ■ 



where N(0) is the population at time t=0, r is the growth rate 
constant, and k is the carrying capacity of the environment. 
The factor k can be constant or time varying; an example 
of the latter is the observed time variant periodic behavior 
of predator-prey populations in closed ecosystems. The fac- 
tor [(1 -N/k)] is referred to as the environmental factor. One 
way of describing this behavior in the context of biological 
organisms is to assume that during the early phase, food is 
used for both growth and sustenance, while at the saturation 
level it is used for sustenance only since growth has stopped. 
The solution to Eq. 10.38 is: 



10.4.4 Logistic Functions 



Nit)^ 



1 + exp [ - r(t - r*)] 



(10.39a) 



The exponential model was described in Sect. 2.4. 3e in 
terms of modeling unrestricted growth. Logistic models are 
extensions of the exponential model in that they apply to in- 
stances where growth is initially unrestricted, but gradually 
changes to restricted growth as resources get scarcer. They 
are an important class of equations which appear in several 
fields for modeling various types of growth — that of popula- 
tions (humans, animal and biological) as well as energy and 
material use patterns (Draper and Smith 1981; Masters and 
Ela 2008). The non-linear shape of these models is captured 
by an S curve (called a sigmoid function) that reaches a stea- 
dy-state value (see Fig. 10.19). One can note two phases: (i) 
an early phase during which the environmental conditions 
are optimal and allow the rate of growth to be exponential, 
and (ii) a second phase where the rate of growth is restricted 
by the amount of growth yet to be achieved, and assumed to 
be directly proportional to this amount. The following model 
captures this behavior of population N over time t: 



where t* is the time at which N = k/2, and is given by: 



dt 



^rN\\ 



N(0) = No (10.38) 



1 / k 
-In — 



(10.40) 



If the instantaneous growth at t=0 is represented by 
R(0)=Rjj, then it can be shown that 



Wo 

k 



and Eq. 10.39a can be rewritten as: 



N(t). 



l + (|^-l)exp(-/?oO 



(10.39b) 



Thus, knowing the quantities N^ and R^ at the start, when 
t=0, allows r, t* and then N(t) to be determined. 

Another useful concept in population biology is the con- 
cept of maximum sustainable yield of an ecosystem (Masters 
and Ela 2008). This corresponds to the maximum removal 
rate which can sustain the existing population, and would oc- 
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cur when ^ = 0, i.e. from Eq. 10.38, when N=k/2. Thus, 
if the fish population in a certain pond follows the logistic 
growth curve when there is no fish harvesting, then the ma- 
ximum rate of fishing would be achieved when the actual 
fish population is maintained at half its carrying capacity. 
Many refinements to the basic logistic growth model have 
been proposed which allow such factors as fertility and mor- 
tality rates, population age composition, migration rates, . . . 
to be considered. 

Example 10.4.3: Use of logistic models for predicting 
growth of worldwide energy use 

The primary energy use in the world in 2008 was about 14 Ter- 
ra Watts (TW). The annual growth rate is close to 2.4%. 

(a) If the energy growth is taken to be exponential (imply- 
ing unrestricted growth), in how many years would the 
energy use double? 

Theexponentialmodelisgivenby: Q{t) — Qoexp(7?oO 
where R^ is the growth rate (=0.024) and Q„(= 14 TW) 
is the annual energy use at the start, i.e. for year 2008. 
The doubling time would occur when Q(t) = 2Qp, and 
by simple algebra: tdoubUng — ^^f^ = 28.9 years, or at 
about year 2037. 

(b) If the growth is assumed to be logistic with a carrying 
capacity of k=45 TW (i.e., the value of annual energy 
use is likely to stabilize at this value in the far future), 
determine the annual energy use for year 2037. With 
t=28.9, and R„=0.024, Eq. 10.39b yields: 



Q{t)- 



45 



1 + (^ - 1) • exp [- (0.0024) • (28.9)] 



21.36 TW 



which is (as expected) less than that predicted by the 
exponential model of 28 TW. The plots in Fig. 10.19, 
generated using a spreadsheet, illustrate logistic growth 
curves for two different values of k with the exponential 
curve also drawn for comparison. The asymptotic beha- 
vior of the logistic curves and the fact that the curves 
start deviating from each other quite early on are note- 
worthy points. ■ 
Logistic models have numerous applications other than 
modeling restricted growth. Some of them are described be- 
low: 

(a) in marketing and econometric applications for mode- 
ling the time rate at which new technologies penetrate 
the market place (for example, the saturation curves of 
new household appliances) or for modeling changes in 
consumer behavior (i.e., propensity to buy) when faced 
with certain incentives or penalties (see Pindyck and 
Rubinfeld 1981 for numerous examples); 

(b) in neural network modeling (a form of non-linear black- 
box modeling discussed in Sect. 11.3.3) where logistic 
curves are used because of their asymptotic behavior at 



either end. This allow the variability in the regressors to 
be squashed or clamped within pre-defmed limits; 

(c) to model probability of occurrence of an event against 
one or several predictors which could be numerical or 
categorical. A medical application could entail a model 
for the probability of a heart attack for a population expo- 
sed to one or more risk factors. Another application is to 
model the spread of disease in epidemiological studies. 
A third is for dose response modeling meant to identify a 
non-linear relationship between the dose of a toxic agent 
to which a population is exposed and the response or risk 
of infection of the individuals stated as a probability or 
proportion. This is discussed further below; 

(d) to model binary responses. For example, manufactured 
items may be defective or satisfactory; patients may re- 
spond positively or not to a new drug during clinical 
trials; a person exposed to a toxic agent could be infec- 
ted or not. Logistic regression could be used to discri- 
minate between the two groups or multiple groups (see 
Sect. 8.2.4 where classification methods are covered). 

As stated in (c) above, logistic functions are widely used 
to model how humans are affected when exposed to different 
toxic load {called dose-response) in terms of a proportion or 
a probability R For example, if a group of people is treated 
by a drug, not all of them are likely to be responsive. If the 
experiments can be performed at different dosage levels x, 
the percentage of responsive people is likely to change. Here 
the response variable is called the probability of "success" 
(which actually follows the binomial distribution) and can 
assume values between and 1, while the regressor variable 
can assume any numerical value. Such variables can be mo- 
deled by the following two parameter model: 

1 

^^"^^ ~ l+exp[-(/Jo + A-«)] ^^^-'^^^ 

where x is the dose to which the group is exposed and 
(Pq,P\) are the two parameters to be estimated by MLE. 
Note that this follows from Eq. 10.39a when k= 1. 

One defines a new variable, called the odds ratio, as 
[P/(l -P)]. Then, the log of this ratio is: 



In 



(10.42) 



1 



where n is called the la git function. Then, simple manipula- 
tion of Eqs. 10.39 and 10.40 leads to a linear functional form 
for the logit model: 



5lJC 



(10.43) 



Thus, rather than formulating a model for P as a non- 
linear function of x, the approach is to model :t as a linear 
function of x. The logit allows the number of failures and 
successes to be conveniently determined. For example, ;r=3 
would imply that success P is e^ — 20 times more likely than 
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failure. Thus, unit increase in x will result in /Si change in 
the logit function. 

The above model can be extended to multi-regressors. 
The general linear logistic model allows the combined effect 
of several variables or doses to be included: 



TT = /So + P\xi + /S2X2 h PkXk 



(10.44) 



Note that the regression variables can be continuous, ca- 
tegorical or mixed. The above models can be used to predict 
the dosages which induce specific levels of responses. Of 
particular interest is the dosage which produces a response in 
50% of the population (median dose). The following exam- 
ple illustrates these notions. 

Example 10.4.4:^ Fitting a logistic model to the kill rate of 
the fruit fly 

A toxicity experiment was conducted to model the kill rate 
of the common fruit fly when exposed to different levels of 
nicotine concentration for a pre-specified time interval (re- 
call that concentration times duration of exposure equals the 
dose). Table 10.16 assembles the experimental results. 

The single variate form of the logistic model (Eq. 10.41) 
is used to estimate the two model parameters using MLE. 
The regressor variable is the concentration while the respon- 
se variable is the percent killed or the proportion. A MLE 
analysis yields the results shown in Table 10.17. 

To estimate the dose or concentration which will 
result in p=0.5 or 50% fatality or success rate is 
straightforward. From Eq. 10.42, the probit value is 
jr = In YTp — ln(l) — 0- Then, using Eq. 10.43, one gets: 
^50 = -(/S0//61) = 0.276 g/100 cc. ■ 



Table 10.16 Data table 


for ExampI 


e 10.4.4 






Concentration Number of 
(g/100 cc) insects 


Number 
killed 


Percent killed 


0.10 


47 




8 




17.0 


0.15 


53 




14 




26.4 


0.20 


55 




24 




43.6 


0.30 


52 




32 




61.5 


0.50 


46 




38 




82.6 


0.70 


54 




50 




92.6 


0.95 


52 




50 




96.2 


Table 10.1 7 


Results of 


maximum 


ikelihood estimation (MLE) 


Parameters 


Estimate 


Standard 


error 


Chi-square p-value 


^0 


-1.7361 


0.2420 




51.4482 


< 0.0001 


Pi 


6.2954 


0.7422 




71.9399 


< 0.0001 



Example 10.4.5: Dose response modeling for sarin gas 
Figure 10.20a are the dose response curves for sarin gas for 
both casualty dose (CD), i.e. which induces an adverse reac- 
tion, and lethal dose (LD) which causes death. The CD^^ and 
LDjji values, which will affect 50% of the population expo- 
sed, are specifically indicated because of their importance as 
stated earlier. Though various functions are equally plausib- 
le, the parameter estimation will be done using the logistic 
curve. Specifically, the LD curve is to be fitted whose data 
has been read off the plot and assembled in Table 10.18. 

In this example, let us assume that all conditions for stan- 
dard multiple regression are met (such as equal error varian- 
ce across the range of the dependent variable), and so MLE 
and OLS will yield identical results. In case they were not, 
and the statistical package being used does not have the MLE 
capability, then the weighted least squares method could be 
framed to yield maximum likelihood estimates. 

First, the dependent variable i.e., the fraction of people 
affected is transformed into its logit equivalent given by 
Eq. 10.43; the corresponding numerical values are given in 
the last row of Table 10.18 (note that the entries at either end 
are left out since log of is undefined). A second order mo- 
del in the dose level but linear in the parameters of the form: 
TT —bQ + b\x-\- b2X was found to be more appropriate than 
the simple model given by Eq. 10.43. The model parameters 
were statistically significant while the model fits the observed 
data veiy well (see Fig. 10.20b) and with Adj R^ = 94.3%. The 
numeiical values of the parameters and their 95% CL listed in 
Table 10.19 provide a visual indication of how well the second 
order probit model fits the data. ■ 



100% 




Exposure Dose, mg • min/m^ 



3.8 
2.8 

^ 1.8 

ID 

I 0.8 

W 

°-0.2 
-1.2 
-2.2 

b 






-2.2 



-1.2 



-0.2 0.8 1. 

predicted 



2.8 



3.8 



From Walpole et al. (2007) by © permission of Pearson Education. 



Fig. 10.20 a Dose-response curve for sarin gas. (From Kowalski 2002 
by permission of McGraw-Hill) b Plot depicting the accuracy of the 
identified second order probit model with observed data 
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Table 10.18 Data used for model building (for Example 10.4.5) 


LDDose(x) 800 1,100 1,210 1,300 1,400 


1,500 


1,600 


1,700 


1,880 


2,100 


3,000 


Fatalities% 10 20 30 40 


50 


60 


70 


80 


90 


100 


Logit values Undefined -2.197 -1.386 -0.847 -0.405 





0.405 


0.847 


1.386 


2.197 


Undefined 



Table 10.19 Estimated model 
parameters of the second order 
probit model 



Parameter 



bo 



Estimate 



Standard error 



-10.6207 



0.334209 



0.0096394 
-0.00000169973 



0.00035302 



8.55464E-8 



95% CL limits 
Lower limit 



-11.411 



0.00880463 



-0.00000190202 



Upper limit 
-9.83045 

0.0104742 
-0.00000149745 



Generalized least squares (GLS) (not to be confused with 
the general linear models described in Sect. 5.4.1) is another 
widely used approach which is more flexible than OLS in the 
types of non-Unear models and error distributions it can accom- 
modate. It can also handle autocorrelated errors and non cons- 
tant variance. The non-linearity can be overcome by transfor- 
ming the response variable into another intermediate variable 
using a link function, and then identifying a linear model bet- 
ween the transformed variable and the response variable. Thus, 
GLS allows ordinary linear regression to be unified with other 
identification methods (such as logistic regression) while requi- 
ring MLE for parameter estimation (discussed in Sect. 10.4.3). 



10.5 Non-linear Estimation 

Non-linear estimation applies to instances when the model 
is non-linear in the parameters. This should not be confused 
with non-linear models wherein the function may be non- 
linear but the parameters may (or may not) appear in a linear 
fashion (an example is a polynomial model). The parameter 
estimation can be done by either least squares or by MLE of 
a suitably defined loss function. Models non-linear in their 
parameters are of two types: (i) those which can be made 
linear by a suitable variable transformation, and (ii) those 
which are intrinsically nonlinear. The latter is very similar to 
optimizing a function (discussed in Chap. 7). A short discus- 
sion of estimation under both instances is given below. 



1 0.5.1 Models Transformable to Linear 
in the Parameters 

Whenever appropriate it is better to convert models non-li- 
near in the parameters to linear ones as the parameter esti- 
mation simplifies considerably. The popularity of linear es- 
timation methods stems from the facts that the computation 
effort is low because of closed form solutions, the appro- 
ach is intuitively appealing, and there exists a wide body of 
statistical knowledge supporting them. However, the trans- 
formation results in sound parameter estimation only when 
certain conditions are met regarding errors/noise, which is 
discussed below. 

Table 10.20 gives a short list of useful transformations for 
nonlinear functions that result in simple linear models, while 
Fig. 10.21 assembles plots of such functions. For example, 
an exponential model is used in many fields of science, en- 
gineering, biology and numerous other fields to characterize 
quantities (such as population, radioactive decay, ...) which 
increase or decrease at a rate that is directly proportional to 
their own magnitude. There are different forms of the expo- 
nential model; the one shown is the most general. For exam- 
ple, special cases of the function shown in Table 10.20 are: 
y = ae'" or j = 1 - ae'". How the numerical value of b affects 
the function is shown in Fig. 10.21a. 

The power function (Fig. 10.21b) is widely used; notice 
the shape of the curves depending on whether the model 
coefficient b is positive or negative. Consider the following 
model: 



Table 10.20 



Some common transformations to make models non-linear in the parameters into linear ones 



Function type 


Functional model 


Transformation 


Transformed linear regression model 


Exponential 


y = exp (a + bixi + b^Xi) 


y*=iny 


y' = a+bj-X|-tb2-Xj 


Power or multiplicative 


y = axi''x2' 


y*=logy, x*=logx 


y*=a+b-X|* + c-Xj* 


Logarithmic 


y = a + b ■ logX] -1- c.logX2 


x*=logx 


y=a+b-Xj* + c-X2* 


Reciprocal 


y = (a+bixi +b2X2r^ 


y-=l/y 


y*=a+b,-Xj+b2-X2 


Hyperbolic 


y=x/(a+b-x) 


y*=l/y;x* = l/x 


y*=b + a-x,* 


Saturation 


y=a-x/(b-l-x) 


y* = l/y; x*=l/x 


y*=l/a + (b/a)-X|' 


Logit 


_ exp(fl+(iiA-i+*2Jr2) 
^ l+exp(fl+ii.(]+&2^2) 


y* = lnT^ 


y'=a.+h^■x^+b^■x^ 
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b>0 



b<0 




Fig. 10.21 Diagrams depicting different non-linear functions (witti slope b) wliich can be transformed to functions linear in the parameters as 
sliown in Table 10.20. a Exponential function, b Power function, c Reciprocal function, d Hyperbolic function (from Sliannon 1975) 



y — aex^ipx) 



(10.45) 



Taking logarithms results in the simple linear model who- 
se parameters are easily identified by OLS: 



\ny — \na + bx 



(10.46) 



However, two important points need to be made. The first 
is that it is the transformed variable which has to meet the 
OLS criteria (listed in Sect. 5.5.1) and not the original va- 
riable. One of the main implicit implications is that the er- 
rors of the variable y are multiplicative. The transformation 
into ln(y) has (hopefully) made the errors in the new model 
essentially additive, thereby allowing OLS to be performed. 
If the analyst believes this to be invalid, then there are two 
options. Adopt a non-linear estimation approach, or use 
weighted least squares (see Sect. 5.6.3 which presents seve- 
ral alternatives). The assumption of multiplicative models in 
certain cases, such as exponential models, is usually a good 
one since one would expect the magnitude of the errors to be 
greater as the magnitude of the variable increases. However, 
this is by no means obvious for other transformations. The 
second point is that if OLS has been adopted, the statistical 
goodness-of-fit indices, such as R^ and RMSE of the regress- 
ion model, as well as any residual checks apply to the trans- 
formed equation and not to the original model. Overlooking 
this aspect can provide misleading results in inferences being 
drawn about how the model explains the variation in the ori- 
ginal response variable. 

Consider another example. The solution of a first- 
order linear differential equation of a decay process: 
T(t) =To ■ exp( — t/x) where T^ and r (interpreted as the 



initial condition and the system time constant respectively) 
are the model parameters. Taking logarithms on both sides 
results in InT(t) —a+fii where a = In To and P = 1/r. 
The model has, thus, become linear, and OLS can be used 
to estimate a and /S , and thence, T^ and r. The parameter 
estimates will, however, be biased; but this will not cause 
any prediction bias when the model is used, i.e., T(t) will be 
accurately predicted. However, the magnitude of the confi- 
dence and the prediction intervals provided by OLS model 
will be incorrect. 

Natural extension of the above single variate models to 
multivariate ones are obvious. For example, consider the mu- 
tivariate power model: 



y — box^^x 



hi 



(10.47) 



If one defines: 

z = ln3',c = \nbQ,Wi = Inx,- for i = l,2...p (10.48) 
Then, one gets the linear model: 



z = c-\-y^biWi 



(10.49) 



1=1 



Example 10.5.1: Data shown in Table 10.21 needs to befit 

using the simple power equation: y — ax . 

(a) Taking natural logarithms results in: Iny = \na + blnx 
which can be expressed as: y* — a' + bx* 
Subsequently, a simple OLS regression yields: 
a'=-0.702 and b'= 1.737 with R^=0.99 (excellent) and 
RMSE=0.292. From here, a=0.498, and b=b'= 1.737. 
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Table 10.21 Data table for Example 10.5.1 








X 1 1.5 2 2.5 3 3.5 


4 


4.5 


5 


y 0.5 1 1.7 2.2 3.4 4.7 


5.7 


6.2 


8.4 



The goodness of fit to the original power model is illus- 
trated in Fig. 10.22a. 

(b) The scatter plot of the residuals versus x (Fig. 10.22b) 
reveals some amount of improper residual behavior (non- 
constant variance) at the high end, but a visual inspection 
is of limited value since it does not allow statistical infe- 
rences to be made. Nonetheless, the higher variability at 
the high end indicates that OLS assumptions are not fully 
met. Finally, the residual analysis reveals that: 

(i) the 8th observation is an unusual residual (studen- 
tized value of -4.83) as is clearly noticeable 

(ii) the 8th and 9th observations are flagged as lever- 
age points with DFITS values of -2.99 and 1.78. 
This is not surprising since these are the high end, 
and with the lower value being close to zero, their 
influence on the overall fit is bound to be great. 

(c) What is the predicted value for y at x = 3.5. Also, de- 
termine the 95% CL for the mean prediction and that 
for a specific value? 

The answers are summarized in the table below. 



y-value 


Standai'd 


Lower 


Upper 95.0% 


Lower 


Upper 


from 


error for 


95.0% CL 


CL for speci- 


95.0% 


95.0% CL 


model 


forecast 


for specific 
value 


fic value 


CLfor 
mean 


for mean 


3.856 


0.31852 


3.103 


4.609 


3.556 


4.156 



The prediction of y for x = 3.5 is very poor Instead of 
a y-value close to 4.7, the predicted value is 3.86. Even 
the 95% CL does not capture this value. This suggests 
that alternative models should be evaluated. ■ 



1 0.5.2 Intrinsically Non-linear Models 

There are numerous functions which are intrinsically non- 
linear. Two examples of such models are: 

y — bQ+ bisxp{—b2x) and y — exp{bi + b2X^) 

(10.50) 



Nonlinear regression approach is the only recourse in 
such cases, as well as in cases when a transformation re- 
sulting in a model linear in the parameters still suffers from 
improper residual behavior Unlike linear parameter esti- 
mation which has closed form matrix solutions, non-linear 
estimation requires iterative methods which closely parallel 
the search techniques used in optimization problems (see 
Sect. 7.3). Recall that the three major issues are: importan- 
ce of specifying good initial or starting estimates, a robust 
algorithm that suggests the proper search direction and step 
size, and a valid stopping criterion. Hence, non-linear esti- 
mation is prone to all the pitfalls faced by optimization pro- 
blems requiring search methods such as local convergence, 
no solution being found, and slow convergence. Either gra- 
dient-based (such as steepest-descent, Gauss-Newton's met- 
hod, first or second order Taylor series,...) or gradient- free 
methods are used. The former are faster but sometimes may 
not converge to a solution, while the latter are slower but 
more robust. Perhaps the most widely used algorithm is the 
Levenburg-Markquardt algorithm which uses the desirable 
features of both the linearized Taylor Series and the steepest 
ascent methods. Its attractiveness lies in the fact that it al- 
ways converges and does not slow down as do most steepest- 
descent methods. Most of the statistical software packages 
have the capability of estimating parameters of non-linear 
models. However, it is advisable to use them with due care; 
otherwise, the trial and error solution approach can lead to 
the program being terminated abruptly (one of the causes 
being ill-conditioning- see Sect. 10.2). Some authors sug- 
gest not to trust the output of a nonlinear solution until one 
has plotted the measured values against the predicted and 
looked at the residual plots The interested reader can refer 
to several advanced texts which deal with non-linear estima- 
tion (such as Bard 1974; Beck and Arnold 1977; and Draper 
and Smith 1981). 

Section 7.3.4, which dealt with numerical search met- 
hods, described and illustrated the general penalty function 
approach for constrained optimization problems. Such pro- 
blems also apply to non-linear parameter estimation where 
one has a reasonable idea beforehand of the range of varia- 
tion of the individual parameters (say, based on physical con- 
siderations), and would like to constrain the search space to 
this range. Such problems can be converted into unconstrai- 



Fig. 10.22 Back transformed 
power model (Example 10.5.1) 
a Plot of fitted model to original 
data, b Plot of residuals 
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ned multi-objective problems. The objective of minimizing 
the squared errors S between measured and model-predicted 
values is combined with another term which tries to maintain 
reasonable values of the parameters p by adversely weighting 
the difference between the search values of parameters and 
their preferred values based on prior knowledge. If the vector 
p denotes the set of parameters to be estimated, then the loss 
or objective function is written as the weighted square sum 
of the model residuals and those of the parameter deviations: 



^(p) = E^-^v' + E^i-^)-^P' 



P,)^ 



(10.51) 



J=i 



where w is the weight (usually a fraction) associated with 
the model residuals, and (1 — w) is the weight associated 
with the penalty of deviating from the preferred values of the 
parameter set. Note that the preferred vector P is not neces- 
sarily the optimal solution P ■ 

A simplistic example will make the approach clearer. One 
wishes to constrain the parameters "a and b" to be positi- 
ve (for example, certain physical quantities cannot assume 
negative values). An arbitrary user-specified loss function 
could be of the sort: 

n 

J(p) = ^ 5/ + 1, 000*(fl < 0) + 1, 000*(Z) < 0) (10.52) 

7=1 

where the multipliers of 1,000 are chosen simply to impose 
a large penalty should either a or b assume negative values. 
It is obvious that some care must be chosen to assign such 
penalties pertinent to the problem at hand, with the above 
example meant for conceptual purposes only. 

Example 10.5.2^: Fit the following non-linear model to the 
data in Table 10.22: y ^ a(\ - e''") + e. 

First, it would be advisable to plot the data and look at the 
general shape. Next, statistical software is used for the non- 
linear estimation using least squares. 

The equation of the fitted model is found to be: y = 2.498* 
(l-exp(-0.2024*t)) with adjusted R^=98.9%, standard error 
of estimate= 0.0661 and mean absolute error=0.0484. The 



Table 10.22 Data table for Example 10.5.2 
2 3 4 5 



1 



11 



0.47 0.74 1.17 1.42 1.60 1.84 2.19 2.17 



Table 10.23 Results of the non-linear parameter estimation 
Parameter Estimate Asymptotic Asymptotic 95.0% confidence 







standard error interval 






Lower 


Upper 


a 


2.498 


0.1072 2.2357 


2.7603 


b 


0.202 


0.0180 0.1584 


0.2464 



overall model fit is thus deemed excellent. Further, the para- 
meters have low standard errors as can be seen from the para- 
meter estimation results shown in Table 10.23 along with the 
95% intervals. 

How well the model predicts the observed data is illus- 
trated in Fig. 10.23a, while Fig. 10.23b is a plot of the mo- 
del residuals. Recall that studentized residuals measure how 
many standard deviations each observed value of y deviates 
from a model fitted using all of the data except that obser- 
vation. In this case, there is one Studentized residual greater 
than 2 (point 7), but none greater than 3. DFITS is a statistic 
which measures how much the estimated coefficients would 
change if each observation was removed from the data set. 
Two data points (points 7 and 8) were flagged as having un- 
usually large values of DFITS, and it would be advisable to 
look at these data points more carefully especially since the- 
se are the two end points. Preferably, the model function may 
itself have to be revised, and, if possible, collecting more 
data points at the high end would result in a more robust and 
accurate model. ■ 



1 0.6 Computer Intensive Methods 

10.6.1 Robust Regression 

Proper estimation of parameters of a pre-specified model de- 
pends on the assumptions one makes about the errors (see 
Sect. 10.4.1). Robust regression methods are parameter esti- 



Fig. 10.23 Back transformed 
non-linear model (Example 
10.5.2) a Plot of fitted model to 
original data, b Plot of residuals 
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^ From Draper and Smith (1981) by permission of John Wiley and 
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mation methods which are not critically dependant on such 
assumptions (Chatfield 1995). The term is also used to de- 
scribe techniques by which the influence of outlier points 
can be automatically down-weighted during parameter es- 
timation. Detection of gross outlier points in the data have 
been addressed previously, and includes such measures as 
limit checks and balance checks (Sect. 3.3.2), visual means 
(Sect. 3.3.3), and statistical means (Sect. 3.6.6). Diagnostic 
methods using model residual analysis (Sect. 5.6) are a more 
refined means of achieving additional robustness since they 
allow detection of influential points. There are also automa- 
ted methods that allow robust regression when faced with 
large data sets, and some of these are described below. 

Recall that OLS assumes certain idealized conditions to 
hold, one of which is that the errors are normally distribu- 
ted. Often such departures from normality are not serious 
enough to warrant any corrective action. However, under cer- 
tain cases, OLS regression results are very sensitive to a few 
seemingly outlier data points, or when response data spans 
several orders of magnitude. Under such cases, the square of 
certain model residuals may overwhelm the regression and 
lead to poor fits in other regions. Common types of deficien- 
cies include errors that may be symmetric but non-normal; 
they may be more peaked than the normal with lighter tails, 
or the converse. Even if the errors are normally distributed, 
certain outliers may exist. One could identify outlier points, 
and repeat the OLS fit by ignoring these. One could report 
the results of both fits to document the effect or sensitivity 
of the fits to the outlier points. However, the identification 
of outlier points is arbitrary to some extent, and rather than 
rejecting points, methods have been developed whereby less 
emphasis is placed during regression on such dubious points. 
Such methods are called robust fitting methods which assume 
some appropriate weighting or loss function. Figure 10.24 
shows several such functions. While the two plots in the up- 
per frame are continuous, those at the bottom are disconti- 
nuous with one function basically ignoring points which lie 
outside some pre-stipulated deviation value, 
(a) Minimization of the least absolute deviations or MAD 

(Fig. 10.24a). The parameter estimation is performed 

with the objective function being to: 



mmimize 



E i-V'- 



yA 



(10.53) 



(b) 



where Ji and y, are the measured and modeled respon- 
se variable values for observation i. This is probably the 
best known robust method, but is said to be generally 
the least powerful in terms of managing outliers. 
Lorentzian minimization adopts the following criteri- 
on: 



l_Qgg Least-squares 

loss function (OLS) 




ii£. 




Least-absolute value 
loss function 



Loss 




Deviation 



Less weight than OLS 
loss function 



I No_wei2ht_on outliers 



b Deviation 

Fig. 10.24 Different weighting functions for robust regression, a OLS 
versus MAD. b Two different outlier weighting functions 



This is said to be very effective with noisy data and data 
that spans several orders of magnitude. It is similar to 
the normal curve but with much wider tails; for exam- 
ple, even at 10 standard errors, the Lorenzian contains 
94.9% of the points. The Gaussian, on the other hand, 
contains the same percentage at 2 standard errors. Thus, 
this function can accommodate instances of significant 
deviations in the data. 
Pearson minimization proceeds to 



(c) 



mmimize 



^ln( 



y. 



(10.55) 



This is the most robust of the three methods with out- 
liers having almost no impact at all on the fitted line. 
This minimization should be used in cases where wild 
and random errors are expected as a natural course. 
In summary, robust regression methods are those which 
are less affected by outliers, and this is a seemingly advisable 
path to follow. However, Draper and Smith (1981) caution 
against the blind and indiscriminate use of robust regression 
since clear rules indicating the most appropriate method to 
use for a presumed type of error distribution do not exist. 
Rather, they recommend against the use of robust regression 
based on any one of the above functions, and suggest that 
maximum likelihood estimation be adopted instead. In any 
case, when the origin, nature, magnitude and distribution of 
en'ors are somewhat ambiguous, the cautious analyst should 
estimate parameters by more than one method, study the re- 
sults, and then make a final decision with due diligence. 



minimize 



^ln(l 



+ 1^,- 



■y/l") 



(10.54) Example 10.6.1: Consider the simple linear regression 
data set given in Example 5.3.1. One wishes to investigate 
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the extent to which MAD estimation would differ from the 
standard OLS method. 

A commercial software program has been used to refit the 
same data using the MAD optimization criterion which is 
more resistant to outliers. The results are summarized below: 

• OLS model identified in Example 5.3.1: y =3. 8296 + 
0.9036*x 

with the 95 % CL for the intercept being { 0.2 1 3 1 , 7 .446 1 } 
and for the slope {0.8011, 1.0061}. 

• Using MAD analysis: y = 2.1579 + 0.9474'x. 

One notes that the MAD parameters fall comfortably wit- 
hin the OLS 95% CL intervals but a closer look at Fig. 10.25 
reveals that there is some deviation in the model lines espe- 
cially at the low range. The difference is small, and one can 
conclude that the data set is such that outliers have little ef- 
fect on the model estimated. Such analyses provide an addi- 
tional level of confidence when estimating model parameters 
using the OLS approach. ■ 



10.6.2 Bootstrap Sampling 

Recall that in Sect. 4.8.3, the use of the bootstrap method 
(one of the most powerful and popular methods currently 
in use) was illustrated for determining standard errors and 
confidence intervals of a parametric statistical measure in a 
univariate context, and also in a situation involving a nonpa- 
rametric approach, where the correlation coefficient between 
two variables was to be deduced. Bootstrap is a statistical 
method where random resampling with replacement is done 
repeatedly from an original or initial sample, and then each 
bootstrapped sample is used to compute a statistic (such as 
the mean, median or the inter-quartile range). The resulting 
empirical distribution of the statistic is then examined and 
interpreted as an approximation to the true sampling dis- 
tribution. Bootstrap is often used as a robust alternative to 



inference-type problems when parametric assumptions are 
in doubt (for example, knowledge of the probability distribu- 
tion of the errors), or where parametric inference is impossi- 
ble or requires very complicated formulas for the calculation 
of standard errors. Note, however, that the bootstrap method 
cannot overcome some of the limitations inherent in the ori- 
ginal sample. The bootstrap samples are to the sample what 
the sample is to the population. Hence, if the sample does 
not adequately cover the spatial range or if the sample is not 
truly random, then the bootstrap results will be seriously in- 
accurate as well. 

How the bootstrap approach can be used in a regression 
context is quite simple in concept. The purpose of a regress- 
ion analysis can be either to develop a predictive model or to 
identify model parameters. In such cases, one is interested in 
ascertaining the uncertainty in either the model predictions 
or in determining the confidence intervals of the estimated 
parameters. Standard or classical techniques were described 
earlier to perform both tasks. These tasks can also be perfor- 
med by numerical methods. Paraphrasing what was alrea- 
dy stated in Sect. 4.8.1: "Efron and Tibshirami (1982) have 
argued that given the available power of computing, one 
should move away from the constraints of traditional para- 
metric theory with its over-reliance on a small set of standard 
models for which theoretical solutions are available, and 
substitute computational power for theoretical analysis. This 
parallels the manner in which numerical methods have, in 
large part, augmented/replaced closed forms solution techni- 
ques in almost all fields of engineering and science." 

Say, one has a data set of multivariate observations: z = 
{y., Xj., x^. } withi=l, ...n (this can be viewed as a sample 
with n observations in the bootstrap context taken from a 
population of possible observations). One distinguishes bet- 
ween two approaches: 

(a) case resampling, where the predictors and response 
observations i are random and change from sample to 



Fig. 10.25 Comparison of OLS 
model (shown solid) along with 
95% confidence and predic- 
tion bands and the MAD model 
(Example 10.6.1) 
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sample. One selects a certain number of bootstrap sub- 
samples (say 1,000) from z., fits the model and saves 
the model coefficients from each bootstrap sample. The 
generation of the confidence intervals for the regression 
coefficients is now similar to the univariate situation, 
and is quite straightforward. One of the benefits is that 
the correlation structure between the regressors is main- 
tained; 
(b) model-based resampling or fixed-X resampling, where 
the regressor data structure is already imposed or known 
with confidence. Here, the basic idea is to generate or 
resample the model residuals and not the observations 
themselves. This preserves the stochastic nature of the 
model structure and so the standard errors are better re- 
presentative of the model's own assumption. The im- 
plementation involves attaching a random error to each 
y , and thereby producing a fixed-X bootstrap sample. 
The errors could be generated: (i) parametrically from a 
normal distribution with zero mean and variance equal 
to the estimated error variance in the regression if nor- 
mal errors can be assumed (this is analogous to the con- 
cept behind the Monte Carlo approach), or (ii) non-pa- 
rametrically, by resampling residuals from the original 
regression. One would then regress the bootstrapped 
values of the response variable on the fixed X matrix 
to obtain bootstrap replications of the regression coef- 
ficients. This approach is often adopted with designed 
experiments. 
The reader can refer to Efron and Tibshirani (1982); Da- 
vison and Hinkley (1997) and other more advanced papers 
such as Freedman and Peters (1984) for a more complete 
treatment. 



Problems 

Pr. 10.1 Compute and interpret the condition numbers for 
the following: 

(a) /(x) = e--^forx = 10 

(b) f(x) = [(x^+iy^-x] for X = 1000 



Pr. 10.2 Compute the condition number of the following 
matrix 

"21 7 -1' 

5 7 7 
4 -4 20 



Pr. 10.3 Chiller data analysis using PCA 
Consider Table 10.4 of Example 10.3.3 which consists of a 
data set of 15 possible characteristic features (CFs) or va- 
riables under 27 different operating conditions of a centri- 
fugal chiller (Reddy 2007). Instead of retaining all 15 va- 
riables, you will reduce the data set first by generating the 
correlation matrix of this data set and identifying pairs of 
variables which exhibit (i) the most correlation and (ii) the 
least correlation. It is enough if you retain only the top 5-6 
variables. Subsequently repeat the PCA analysis as shown in 
Example 10.3.3 and compare results. 

Pr. 10.4 Quality control of electronic equipment involves 
taking a random sample size n and determining the propor- 
tion of items which are defective. Compute and graph the 
likelihood function for the two following cases: (a) n= 6 with 
2 defectives, (b) n=8 and 3 defectives. 

Pr. 10.5 Consider the data of Example 10.4.2 to which the 
two parameters of a WeibuU distribution were estimated ba- 
sed on MLE. Compare the goodness of this fit (based on the 
Chi-square statistic) to that using a logistic model. 

Pr. 10.6 Indoor air quality measurements of carbon dioxide 
concentration reveal how well the building is ventilated, i.e., 
whether adequate ventilation air is being brought in and pro- 
perly distributed to meet the comfort needs of the occupants 
dispersed throughout the building. The following twelve 
measurements of CO, in parts per million (ppm) were taken 
in the twelve rooms of a building: 

{732, 816, 875, 932, 994, 1003, 1050, 1113, 
1163,1208,1292,1382} 

(a) Assuming normal distribution, estimate the true average 
concentration and the standard deviation using MLE, 

(b) How are these different from classical MME values? 
Discuss. 

Pr. 10.7 Non-linear model fitting to thermodynamic proper- 
ties of steam 

Table 10.24 lists the saturation pressure in kPascals for dif- 
ferent values of temperature extracted from the well-known 
steam tables. 

Two different models proposed in the literature are: 
Pv,sai = c. exp ( j^) and \np^,^,ai = a + f where T is the 
temperature in units Kelvin. 

(a) You are asked to estimate the model parameters by both 
OLS (using variable transformation to make the esti- 



Table 10.24 Data table for Problem 10.7 
















Temperature t (°C) 10 20 


30 


40 


50 


60 


70 


80 


90 


Pressure p^^__, (kPa) 1.227 2.337 


4.241 


7.375 


12.335 


19.92 


31.16 


47.36 


70.11 
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Table 10.25 Data table for Problem 10.8 



Type of 
equip- 
ment 



Total 
number 
of units 



Annual energy 
use (kWh) 



Initial 
number 
of new 
models 



Initial Predic- 
growth ted sa- 
rate turation 

fraction 







Current 
model 


New 
model 


N„ 


r% 




Central 

A/C 


10,000 


3,500 


2,800 


100 


5 


0.40 


Color 
TV 


15,000 


800 


600 


150 


8 


0.60 


Lights 


40,000 


1,000 


300 


500 


20 


0.80 



mation linear) and by MLE along with standard errors 
of the coefficients. Comment on the differences of both 
methods and the implied assumption, 

(b) Which of the two models is the preferred choice? Give 
reasons, 

(c) You are asked to use the identified models to predict 
saturation pressure at t=75°C along with the model pre- 
diction error. Comment. 

Pr. 10.8 Logistic functions to study residential equipment 
penetration 

Electric utilities provide incentives to homeowners to repla- 
ce appliances by high-efficiency ones — such as lights, air 
conditioners, dryers/washers, .... In order to plan for future 
load growth, the annual penetration levels of such equipment 
needs to be estimated with some accuracy. Logistic growth 
models have been found to be appropriate since market pe- 
netration rates reach a saturation level, often specified as a 
saturation fraction or the fractional number of the total who 
purchase this equipment. Table 10.25 gives a fictitious exam- 
ple of load estimation with three different types of residenti- 
al equipment. If the start year is 2010, plot the year-to-year 
energy use for the next 20 years assuming a logistic growth 
model for each of the three pieces of equipment separately, 
and also for the combined effect. 

Hint: The carrying capacity can be calculated from the pre- 
dicted saturation fraction and the difference in annual energy 
use between the current model and the new one. 

Pr. 10.9 Fitting logistic models for growth of population 
and energy use 

Table 10.26 contains historic population data from 1970 till 
2010 (current) as well as extrapolations till 2050 (from the 
U.S. Census Bureau's International Data Base). Primary ener- 
gy use in million tons of oil equivalent (MTOE) consumed 
annually has also been gathered but for the same years (the 
last value for 2010 was partially extrapolated from 2008). 
You are asked to analyze this data using logistic models, 
(a) Plot the population data. 



Table 10.26 


Data table for Problem 10.9 




Year 


World population 
(in billion) 


Primary energy use 
in MTOE 


1970 


3.712 


4970.2 


1980 


4.453 


6629.7 


1990 


5.284 


8094.7 


2000 


6.084 


9262.6 


2010 


6.831 


1,150" 


2020 


7.558 (projected) 


- 


2030 


8.202 


- 


2040 


8.748 


- 


2050 


9.202 


- 



" Extrapolated from measured 2008 value 

(b) Estimate growth rate and carrying capacity by fitting 
a logistic model to the data from 1970-2010. Does 
your analysis support the often quoted estimate that the 
world population would plateau at 9 billion, 

(c) Predict population values for the next four decades 
along with 95% uncertainty estimates, and compare 
them with the projections by the US Census Bureau, 

(d) Use a logistic model to fit the data and estimate the 
growth rate and the carrying capacity. Calculate the per 
capita annual energy use for each of the five decades 
from 1970 to 2010. Analyze results and draw pertinent 
conclusions. 

Pr. 10.10 Dose response model fitting for VX gas 
You will repeat the analysis illustrated in Example 10.4.5 
using the dose response curves for VX gas which is a ner- 
ve agent. You will identify the logistic model parameters for 
both the causality dose (CD) and lethal dose (LD) curves and 
report relevant model fit and parameter statistics (Fig. 10.26). 

Pr. 10.11 Non-linear parameter estimation of a model bet- 
ween volume and pressure of a gas 

The pressure P of a gas corresponding to various volumes V 
is given in Table 10.27. 



100% 

80% 

.| 60% 

£ 40% 

20% 

0% 



VXGas 




CD50 



4 6 8 10 12 14 16 
Exposure Dose, mg-min/m^ 



18 20 



Fig. 10.26 Dose response curves for VX gas with 50% casualty dose 
(CD50) and 50% lethal dose (LD50) points. (From Kowalski 2002 by 
permission of McGraw-Hill) 
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Table 10.27 


Data table for Problem 10.11 








V (cm') 


50 60 70 




90 


100 


P(kg/cm^) 


64.7 51.3 40.5 




25.9 


7.8 


Table 10.28 


Data table for Problem 10.12 








x(cm) 30 


35 40 45 50 55 


60 


65 


70 75 


y 0.85 


0.67 0.52 0.42 0.34 0.28 


0.2^ 


1 0.21 


0.18 0.15 



(a) Estimate the coefficients a and b assuming the ideal gas 
law: PV='=b, 

Hint: Take natural logs and re-arrange the model in a form 
suitable for regression, 

(b) Study model residuals and draw relevant conclusions. 

Pr. 10.12 Non-linear parameter estimation of a model bet- 
ween light intensity and distance 

An experiment was conducted to verify the intensity of light 
(y) as a function of distance (x) from a light source; the re- 
sults are shown in Table 10.28. 

(a) Plot this data and fit a suitable polynomial model, 

(b) Fit the data with an exponential model, 

(c) Fit the data using a model derived from the underlying 
physics of the problem, 

(d) Compare the results of all three models in terms of their 
model statistics as well as their residual behavior. 

Pr. 10.13 Consider a hot water storage tank which is heated 
electrically. A heat balance on the storage tank yields: 



P(t) = CT,(t) + L[Ts(t)-Ta(t)] 



(10.56) 



where 

C =M c = thermal capacity of the storage tank [J/''C] 

L =U A = heat loss coefficient of the storage tank [W/''C] 

(A= surface area of storage) 
T (t) = temperature of storage ["C] 
T^(t) = ambient temperature [°C] 
P(t) = heating power [W] as function of time t 

Suppose one has data for T (t) during cool-down under 
constant conditions: P(t) = and T (t) = constant. In that case. 



Table 10.29 Data table for Problem 10.13 




t(h) 12345678 


9 10 11 


T(t) 10.1 8 6.8 5.7 4.4 3.8 3 2.4 2 


1.8 1.1 1 



the energy balance can be written as (where the constant has 
been absorbed in T, i.e. T is now the difference between the 
storage temperature and the ambient temperature) 



TT(t) + T(t) = 



(10.57) 



C 



with X — — — time constant. 

Table 10.29 assembles the test results during storage tank 
cool-down. 

(a) First, linearize the model and estimate its parameters, 

(b) Study model residuals and draw relevant conclusions, 

(c) Determine the time constant of the storage tank along 
with standard errors 

Pr. 10.14 Model identification for wind chill factor 
The National Weather Service generates tables of wind chill 
factor (WC) for different values of ambient temperature (T) 
in °F and wind speed (V) in mph. The WC is an equivalent 
temperature which has the same effect on the rate of heat loss 
as that of still air (an apparent wind of 4 mph). The equation 
used to generate the data in Table 10.30 is: 

WC = log V{Q.26T) - 23.68(0.637 + 32.9) (10.58) 



(a) 



(b) 

(c) 
(d) 

(e) 



You are given this data without knowing the model. 
Examine a linear model between WC = f(T,V), and 
point out inadequacies in the model by looking at the 
model fits and the residuals. 

Investigate, keeping T fixed, a possible relation bet- 
ween WC and V, 
Repeat with WC and T 

Adopt a stage-wise model building approach and eva- 
luate suitability, 

Fitamodelofthetype: WC = a-\-b -T + c ■ V -\-d(Vyi^ 
and evaluate suitability 



Table 10.30 Data table for Problem 10.14. (From 


Chatterjee 


andPricel991by 


pennission 


of John Wiley and Sons) 






Wind speec 


1 Actual 
50 


air temperature 
40 30 


(°F) 




















(mph) 


20 


10 





-10 


-20 


-30 


-40 


-50 


-60 


5 


48 


36 


27 




17 


5 


-5 


-15 


-25 


-35 


-46 


-56 


-66 


10 


40 


29 


18 




5 


-8 


-20 


-30 


-43 


-55 


-68 


-80 


-93 


15 


35 


23 


10 




-5 


-18 


-29 


-42 


-55 


-70 


-83 


-97 


-112 


20 


32 


18 


4 




-10 


-23 


-34 


-50 


-64 


-79 


-94 


-108 


-121 


25 


30 


15 


-1 




-15 


-28 


-38 


-55 


-72 


-88 


-105 


-118 


-130 


30 


28 


13 


-5 




-18 


-33 


-44 


-60 


-76 


-92 


-109 


-124 


-134 


35 


27 


11 


-6 




-20 


-35 


-48 


-65 


-80 


-96 


-113 


-130 


-137 


40 


26 


10 


-7 




-21 


-37 


-52 


-68 


-83 


-100 


-117 


-135 


-140 


45 


25 


9 


-8 




-22 


-39 


-54 


-70 


-86 


-103 


-120 


-139 


-143 


50 


25 


8 


-9 




-23 


-40 


-55 


-72 


-88 


-105 


-123 


-142 


-145 
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Table 10.31 Data table for Problem 10.15 




Before weather-stripping 


After weather 


-stripping 


Ap (Pa) Q (m%) 


Ap (Pa) 


Q (m'/h) 


3.0 365.0 


2.2 


99.2 


5.0 445.9 


5.5 


170.4 


5.8 492.7 


6.7 


185.6 


6.7 601.8 


8.2 


208.5 


8.2 699.2 


11.6 


263.2 


9.0 757.5 


13.5 


283.1 


10.0 812.4 


15.6 


310.2 


11.0 854.1 


18.2 


346.2 



Pr. 10.15 Parameter estimation for an air infiltration model 
in homes 

The most common manner of measuring exfiltration (or in- 
filtration) rates in residences is by artificially pressurizing 
(or depressurizing) the home using a device called a blower 
door. This device consists of a door-insert with an rubber 
edge which can provide an air-tight seal against the door- 
lamb of one of the doors (usually the main entrance). The 
blower door has a variable speed fan, an air flow measuring 
meter and a pressure difference manometer with two plastic 
hoses (to allow the inside and outside pressure differential 
to be measured). All doors and windows are closed during 
the testing. The fan speed is increased incrementally and the 
pressure difference Ap and air flow rate Q are measured at 
each step. The model used to correlate air flow with pres- 
sure difference is a modified orifice flow model given by 
Q — k{Ap)" where k is called the flow coefficient (which 
is proportional to the effective leakage area of the building 
envelope) and n is the flow exponent. The latter is close to 
0.5 when the flow is strictly turbulent which occurs when the 
flow paths through the interstices of the envelope are small 
and tortuous (like in a well-built "tight" house), while n is 
close to 1.0 when the flow is laminar (such as in a "loose" 
house). Values of n around 0.65 have been experimentally 
determined for typical residential construction in the U.S. 

Table 10.31 assembles test results of an actual house whe- 
re blower door tests were performed both before and after 
weather-stripping. Identify the two sets of coefficients k and 
n for tests done before and after the house tightening. Based 
on the uncertainty estimates of these coefficients, what can 
you conclude about the effect of weather-stripping the house? 
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A broad description of inverse methods (introduced in 
Sect. 1.3.3) is that they pertain to the case when the system 
under study already exists, and one uses measured or obser- 
ved system behavior to aid in the model building. The three 
types of inverse problems were classified as: (i) calibration 
of white-box models (which can be either relatively simple 
or complex-coupled simulation models) which requires that 
one selectively manipulate certain parameters of the model to 
fit observed data. Monte Carlo methods which are powerful 
random sampling techniques are described along with regio- 
nal sensitivity analyses as a means of reducing model order 
of complex simulation models; (ii) model selection and para- 
meter estimation involving positing either black-box or grey- 
box models, and using regression methods to identify model 
parameters based on some criterion of error minimization (the 
least squares regression method and the maximum likelihood 
method being the most popular); and (iii) control problems 
where input states and/or boundary conditions are inferred 
from knowledge of output states and model parameters. This 
chapter elaborates on these methods and illustrates their ap- 
proach with case study examples. Local polynomial regress- 
ion methods are also briefly described as well as the multi- 
layer perceptron approach, which is a type of neural network 
modeling method that is widely used in modeling non-linear 
or complex phenomena. Further, the selection of a grey-box 
model based more on policy decisions rather than on how well 
a model fits the data is illustrated in the framework of dose-re- 
sponse models. Finally, state variable model formulation and 
compartmental modeling appropriate for describing dynamic 
behavior of linear systems are introduced along with a discus- 
sion of certain identifiability issues in practice. 



1 1 .1 Inverse Problems Revisited 

Inverse problems were previously introduced and classified 
in Sect. 1.3.3. They are classes of problems pertaining to the 
case when the system under study already exists, and one uses 
measured or observed system behavior to aid in the model 



building. The three types of inverse problems were also brief- 
ly described, namely, calibration of white box models, model 
selection and parameter estimation (or system identification), 
and control problems (see Fig. 1.12). The inverse approach 
makes use of the additional information not available to the 
forward approach, viz. measured system performance data, 
to either tune the parameters of an existing model or identify 
a macroscopic model that captures the major physical inter- 
actions of the system and whose parametric values are deter- 
mined by regression to the data. This "fine tuning" makes the 
inverse approach more suitable for diagnostics and analysis 
of system properties provides better conceptual insights into 
system behavior, and allows more accurate predictions and 
optimal control. The forward approach, on the other hand, is 
more appropriate for evaluating different alternatives during 
the design phase and for sensitivity studies. The inverse ap- 
proach has been used in a wide variety of areas such as en- 
gineering, industrial process control, aeronautics, signal pro- 
cessing, biomedicine, ecology, transportation, and robotics. 
This chapter elaborates on these methods, and illustrates their 
approach with case study examples. 



11.2 Calibration ofWhlte Box Models 

11.2.1 Basic Notions 

Calibration problems generally involve mechanistic white- 
box models that have a well defined model structure, i.e., 
the set of modeling equations can be solved uniquely. They 
can be of two types. One situation is when a detailed simu- 
lation program (developed for forward applications) is used 
to provide the model structure, with the numerous model pa- 
rameters needing to be tuned so that simulated output close- 
ly matches observed system performance. Most often, such 
models have so many parameters that the measured data does 
not allow them to be identified uniquely. This is referred to 
as over-constrained (or over-parameterized) which has no 
unique solution since one is trying to identify the numerous 
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model parameters with too little "information" about the 
behavior of the system (i.e., relatively too few data points 
and/or limited in "richness"). The order of the model has to 
be reduced mathematically by freezing certain of these pa- 
rameters at their "best-guess" values. Which parameters to 
freeze, how many parameters can one hope to identify, what 
will be the uncertainty of such calibrated models are import- 
ant aspects which need to be considered. The second type 
of situation is when the analyst only has a few data points, 
and adopts the approach of developing a simple mechanistic 
model which directly provides an analytical relationship bet- 
ween the simulation outputs and inputs consistent with the 
data at hand, i.e., such that the model parameters can be iden- 
tified uniquely. Hence, both situations are characterized with 
relatively little data "richness", and parameters are identified 
by solving the set of equations or by search methods requi- 
ring no regression. 

A trivial example of an over-constrained problem is one 
where the three model coefficients {a 1,02, 03} are to be deter- 
mined when only two observations of {Ap, V} are available: 

Ap — ai + a2 ■ V + a^ ■ V 

There are two observations and three parameters, and clearly 
this cannot be solved unless the model is simplified or reform- 
ulated so as to have two parameters only. Alternatively, one 
could collect more data and circumvent this limitation. 



1 1 .2.2 Example of Calibrated Model 

Development: Global Temperature 
Model 

The atmosphere around the earth acts as a greenhouse, se- 
lectively letting in short-wave solar radiation and absorbing 



(and, thus trapping) the long wave infrared back-radiation 
from the earth's surface. This warming is beneficial to life 
on earth since the earth's average surface temperature has 
been increased from about -18°C (-0.4°F) to the current 
14°C (57°F). However, the recent increase in global war- 
ming of the earth's lower atmosphere, as a result of human 
activity (called anthropogenic global warming — AGW) and 
associated dire consequences, has resonated with society at 
large, and associates spurred scientists and policymakers to 
find ways to mitigate against it. Very complex mechanistic 
circulation models have been developed, and subsequent- 
ly calibrated against actual measurements. Such calibrated 
models are meant to make extrapolations over time so as to 
study the impact of different AGW mitigation strategies. Ho- 
wever, since these mitigation measures have financial and 
trans-continental societal implications, coupling )them to 
models which capture such considerations would result in 
them soon becoming too complex for clear interpretation and 
policy-making. It is in such cases that inverse models can be 
useful, and have been the object of numerous studies in the 
published literature. A rather simplistic inverse model of the 
earth's radiation budget is presented below meant more to 
illustrate the concept of developing a calibrated white-box 
model than for any practical usefulness. 

Consider Fig. 11.1 taken from IPCC (1996) which shows 
the various thermal fluxes and their numerical values (de- 
termined by direct measurement and from other complex 
models). Let us assume that these quantities are known to us 
(similar to taking measurements of an actual existing system), 
and can be used to develop an appropriate model. Clearly, the 
complexity of the model is constrained by the available meas- 
urements. The atmosphere is modeled as a single node with 
different upward and downward emissivities (so as to partly 
account for the AGW effect) interacting with the surface of 



Fig. 11.1 Average energy flows 
between the Earth's surface, the 
atmosphere and space under 
global equilibrium. (Taken from 
IPCC 1996) 



Ou [going 
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radiation 
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Fig. 11 .2 Sketch of simplified 
global temperature model. The 
atmosphere and the earth's 
surface have been modeled as 
single nodes 
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the earth, also represented by a single node. The various heat 
flows to be considered in the model are shown in Fig. 1 1 .2. 

Let Tatm be the aggregated effective atmospheric temper- 
ature (in Kelvin), Tsw the average effective surface tempera- 
ture of the earth (in Kelvin), R the mean radius of the earth 
and Sex, the extraten-estrial incoming solar flux at the mean 
sun-earth distance called the solar constant (= 1,367 W/m-). 
The temperature of deep space is taken to be K. The thick- 
ness of the atmosphere is so small compared to R that one 
can safely neglect it. Then, the average solar power per unit 
area on top of the atmospheric layer: 



Sext^R 






1367 



342 W/m^ (11.1) 



This is the average value distributed over day and night and 
over all latitudes (shown in Fig. 11.1 as the incoming solar 
radiation). The various heat flows are modeled by adopting 
average aggregate radiative properties such that (refer to 
Fig. 11.2): 

(i) the incoming short wave solar radiation undergoes 
absorption in the atmosphere (absorptivity aai„^sw), 
transmission (transmittivity rarm,sw) and reflection 
back to deep space; 
(ii) most of the transmitted solar radiation is absorbed by 
the earth's surface (absorptivity asur,sw) and the rest is 
reflected back to deep space (multiple reflection effect 
is overlooked); 
(iii) the earth is assumed to be a black body for long wave 
radiation (absorptivity oisurjw =emissivity Ssurjw = 1); 
(iv) thermo-evaporation accounts for a certain amount 
of heat transfer from the earth to the atmosphere 
(=24h-78 W/m^), and a simplified model is 



<}th- 



— '^c—e\^ SI 



^ atm) 



(11.2) 



(V) 



where hc-e is an effective combined coefficient; 
the atmosphere transmits some of the long-wave surfa- 
ce radiation from the earth to deep space through the at- 
mospheric window (=40 W/m^) and absorbs most of it 
(long-wave absorptivity oi„,mjw) with reflectivity being 
negligible; 
(vi) the atmosphere loses some of the heat flux (=165-1- 
30 W/m^) by long wave radiation upwards to deep spa- 



ce (emissivity £„ 



miMph 



and the rest downwards to earth 



as back radiation (emissivity Satmjown)- 
Heat balance on the atmosphere: 



^alm,sw ~r "c— e ' \^ sur ^ atm) 






(11.3) 



where the multiplier of 2 is introduced since heat losses from 
the atmosphere occur both upwards and downwards. 
Heat balance on the earth's surface: 



{S 






w \ (^surjw ' \^atm,down ' ^ ^ aim) 



(Ts, 



^atm) ~r ^s 



(rr. 



(11.4) 



where a is the Stephan-Boltzmann cons- 
tants 5.67 X lO^'^W/m^ K\ 

A forward simulation approach would involve solving the 
above equations using pre-calculated or best-guess values of 
the numerous coefficients appearing in these equations with- 
out any consideration to actual measured fluxes and tempera- 
tures. The same model becomes an inverse calibrated model 
if model parameters can be estimated or tuned from actual 
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measurements, and the value of the variables (in this case, the 
earth surface temperature), subsequently, predicted. Compar- 
ing the predicted and measured values provides a means of 
evaluating the model. In order to avoid the issue of over-pa- 
rameterization, the model has been deliberately selected so as 
to be well specified with zero degrees of freedom. 

Numerical values of most of these coefficients can be de- 
termined from the quantities shown in Fig. 11.1. Average so- 
lar flux absorbed by the atmosphere = 67 — S ■ Q!o„„ ,„, from 
where 

a„,„,^,„, = 67/342 = 0.196. 

Average solar flux transmitted through the atmosphere = 
(342-77-67) = S ■ Tatm,sw from where 

ro„„,.™. = 198/342 = 0.579 

Average solar flux absorbed by the earth's surface =168 

= (ta(m,™5) ■ asur,.»v from whcrc 

a,,r,sw = 168/198 = 0.848 

Average long wave radiation flux from earth absorbed by the 
atmosphere = 350 = Qsur,hy^aim,iw from where 

oiaunH' = 350/390 = 0.897 

Average long wave radiation fluxes to deep space and to- 
wards the earth are: 



2 ■ £atm,up ■ Cr^fl/m — 165 



30 



2-e, 



(11.5) 



aim, down 



■'^C„=324 



with: 



^arm,itp ~r ^atm^down — J- 



Using parameter values already deduced above, Eqs. 11.3- 
11.5 can be solved resulting in e = 0.376 and 

s„,„Mto„„ = 0-624 while ?;,„,= 260.1 K or- 12.9°C and 
T,ur = 288 K or 15°C which is within TC of the measu- 



Table 11.1 Description of quantities and their numerical values ap- 
pearing in Sect. 11.2.2 



Quantity 


Symbol Numerical 
value 


Average incoming solar flux 


S 342 W/m^ 


Absoiptivity of atmosphere to solar 
radiation 


aalm.sw 0.196 



Transmitivity of atmosphere to solar 
radiation 


^atm,sw 


0.579 


Absoiptivity of eai'th's surface to solai' 
radiation 


0!s„r,.ur 


0.848 


Absorptivity of atmosphere to long wave 
radiation 


^alm.hv 


0.897 


Emissivity of earth's surface to long wave 
radiation 


^siirjw 


1.0 


Emissivity of atmosphere for radiation to 
deep space 


^alm,itp 


0.376 


Emissivity of atmosphere for radiation 
downwards to earth 


^atm,dow 


„ 0.624 



red mean surface temperature of the earth. Values of all the 
parameters of the model are assembled in Table. 11.1 for 
easy reference. 

The above example illustrates the use of a white box 
model whose parameters have been calibrated or tuned with 
the data (observed or estimated heat fluxes). Having such a 
model with parameters that can be interpreted as physical 
quantities allows one to evaluate the impact of different miti- 
gation measures. One geo-engineering option is to seed the 
atmosphere with reflective strips so as to increase reflectiv- 
ity to incoming solar radiation, and thereby reduce '':atm,sw 
Another avenue would be to evaluate the extent to which re- 
ducing carbon dioxide levels in the atmosphere would lower 
surface temperatures of the Earth. This measure would im- 
pact the absorptivity, and thereby the transmittivity Xatmjw 
of the atmosphere to long wave re-radiation from the surface. 
Fig. 1 1.3a and b are plots indicating how the earth's surface 
temperature and the atmospheric mean temperature would 
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Fig. 11.3 Sensitivity results of changing key parameters of the simplistic global temperature model, a Effect of changing the baseline absorptivity 
of the atmosphere to long wave radiation from earth surface, b Effect of changing the baseline transmittivity of the atmosphere to solar radiation 



11.2 Calibration of White Box Models 



331 



vary under these scenarios. One notices that the trends are 
fairly linear indicating that a 5% reduction in tatmjw would 
reduce the earth's surface temperature by 4°C while a 10% 
reduction would result in a decrease of 7°C. 

How good or realistic are these predictions? Should one 
include additional terms than those shown in Fig. 11.2 which 
have been neglected in the above analysis? Is our approach of 
assuming the entire atmosphere to be one lumped node and 
making somewhat empirical adjustments to account for dif- 
ferences in long wave radiation in the upward and downward 
directions realistic? How valid is our approximation of as- 
suming a mean surface temperature of the earth? The oceans 
and the landmass behave very differently — should these 
not be treated separately? Should latitude dependency and 
altitude effects be brought into the model; and if so, how? 
Detailed models have indicated that even if the man-made 
carbon dioxide emissions are cut to zero immediately, the 
effects of global warming will persist for decades to come; 
this suggests that dynamic effects are important and require 
the use of differential equations. How should feedback loops 
between the ocean, earth and the atmosphere be included 
into the model? The atmosphere has at least four distinct 
layers with different temperature profiles; how should these 
be treated? All these issues related to system identification 
are some of the many modeling considerations one ought to 
evaluate. One could also question the manner in which the 
various parameters were estimated (as shown in Table. 11.1). 
Are there other ways of estimating them which will lead to 
better results? Finally, consider the data itself, i.e., the fluxes 
shown in Fig. 11.1. How accurate are these values? How 
have they been determined? Can additional spatial and tem- 
poral measurements (of the ocean and the earth surface, for 
example) be made which can be used to improve our model? 
Such questions require that one try to acquire more data (ei- 
ther by direct observations or from detailed computer simu- 
lations) and gradually increase the complexity of the model 
so as to make it more realistic. 

This example has been presented more to illustrate the 
process of framing and tuning white box models to avail- 
able data. However, the overarching nagging concern in such 
cases is whether the model developed is realistic and com- 
plete enough that it captures the inherent complexity of the 
problem at hand. In such cases, a powerful argument can be 
made that calibrating detailed and complex models meant for 
forward simulation, even if the process has its own limita- 
tions, is the only rational approach for very complex prob- 
lems. This approach is addressed in the next two sections. 



1 1 .2.3 Analysis Techniques Useful for 

Calibrating Detailed Simulation Models 

The previous section dealt with developing and calibrating 
rather simple mechanistic models which can be algebraic or 



differential equations. This section pertains to coupled com- 
plex set of models with large number of parameters and whe- 
re a precise relationship between outputs and inputs cannot 
be expressed analytically because of the complex nature of 
the coupling. Solving such sets of equations require compu- 
ter programs with relatively sophisticated solution routines. 
Calibration and validation of such models has been addres- 
sed in several books and journal papers in diverse areas of 
engineering and science such as environmental, structural, 
hydrology, epidemiology and structural engineering. The 
crux of the problem is that the highly over-parameterized 
situation leads to a major difficulty aptly stated by Hornber- 
ger and Spear (1981) as: "...most simulation models will be 
complex, with many parameters, state-variables and non-li- 
near relations. Under the best circumstances, such models 
have many degrees of freedom and, with judicious fiddling, 
can be made to produce virtually any desired behavior, of- 
ten with both plausible structure and parameter values." This 
process is also referred to in the scientific community as 
GIGOing (garbage in-garbage out) where a false sense of 
confidence can result since precise outputs are obtained by 
arbitrarily restricting the input space (Saltelli 2002). Hence, 
given the limited monitored data available, one can at best 
identify only some of the numerous input parameters of the 
set of models. Thus, model reduction is a primary concern 
and several relevant analysis techniques are discussed below. 

(a) Sensitivity Analysis The aim of sensitivity analysis is 
to determine or identify which parameters, have a signifi- 
cant effect on the simulated output parameters, and then, to 
quantify their relative importance. There are two types of 
sensitivities: (i) individual sensitivities which describe the 
influence of a single parameter on system response, and (ii) 
total sensitivities due to variation in all parameters together. 
The general approach to determining individual sensitivity 
coefficients is summarized below: 
(i) Formulate a base case reference and its description, 
(ii) Study and break down the factors into basic parameters 

(parameterization), 
(iii) Identify parameters of interest and determine their base 

case values, 
(iv) Determine which simulation outputs are to be investiga- 
ted and their practical implications, 
(v) Introduce perturbations to the selected parameters ab- 
out their base case values one at a time, 
(vi) Study the corresponding effects of the perturbation on 

the simulation outputs, 
(vii) Determine the sensitivity coefficients for each selected 
parameter. 
Sensitivity coefficients (also called, elasticity in econom- 
ics, as well as influence coefficients) are defined in various 
ways as shown in Table. 1 1.2. The first group is identical to 
the partial derivative of the output variable OP with respect 
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Table 1 1 .2 Different forms of sensitivity coefficient. (From Lam and 
Hui 1996) 



Form 


Formula' 




Dimension 


Common name(s) 


1 


\0P 

MP 




With 

dimension 


Sensitivity coefficient, 
influence coefficient 


2a 


AOP/OPbc 
MP/IPbc 




% OP change 
% IP change 


Influence coefficient, point 
elasticity 


2b 


t^OPjOPBC 

I^IP 




With 
dimension 


Influence coefficient 


3a 


A/P/"'"l+"'2' 


% OP change 
% IP change 


Arc mid-point elasticity 


3b 


(i?f)/( 


Sp\ 
IP ) 


% OP change 
% IP change 


(See note 2) 



1. AOP.AIP =changes in output and input respectively 
OPbcJPbc =base case values of output and input respectively 
IPi,IP2 =two values of input 

OPuOPi =two values of the corresponding output 
OP, IP =mean values of output and input respectively 

2. For the form (3b), the slope of the linear regression line divided 
by the ratio of the mean output and mean input values is taken for 
determining the sensitivity coefficient 



to the input parameter (IP). The second group uses the base 
case values to express the sensitivity in percentage change, 
while the third group uses the mean values to express the 
percentage change (this is similar to forward differencing 
and central differencing approaches used in numerical meth- 
ods). Form (1) is used in comparative studies because the 
coefficient thus calculated can be used directly for error as- 
sessment. Forms (2a), (3a) and (3b) have the advantage that 
the sensitivity coefficients are dimensionless. However, form 
(3a) can only be applied to one-step change and cannot be 
used for multiple sets of parameters. 

There are a wide range of such analysis methods as pre- 
sented in a book by Saltelli et al. (2000) and in numerous 
technical papers. In such methods, the analyst is often faced 
with the difficult task of selecting the one method most ap- 
propriate for his application. A report by Iman and Helton 
(1985) compares different sensitivity analysis methods as 
applied to complex engineering systems, and summarizes 
current knowledge in this area. Of all the techniques, three 
have been found to be promising for multi-response sensi- 
tivities: (i) Response surface^ replacement of the computer 
model where fractional factorial design is used to generate 
the response surface (see Sect. 6.4). This method is optimal 
if the models are linear; (ii) Differential analysis which is in- 
tended to provide information with respect to small pertur- 
bations about a point. However, this approach is not suited 
for complex models with large uncertainties; and (iii) Latin 
hypercube Monte Carlo sampling which offers several ad- 



' Experimental design methods, such as 2'' factorial designs (see 
Sect. 6.3.1) also provide a measure of sensitivities and have been in 
existence for several decades. However, these methods have not been 
identified as promising since they only provide one-way sensitivity 
(i.e., the effect on the system response when only one parameter is var- 
ied at a time) rather than the multi-response sensitivity sought. 



vantages and is described below. When the number of input 
parameters is large along with large uncertainty in the input 
parameters, and when the input parameters are interdepend- 
ent and non-linear, Monte Carlo methods (though compu- 
tationally more demanding) are simpler to implement and 
require a much lower level of mathematics while providing 
adequate robustness. 

(b) Monte Carlo (MC) Methods The Monte Carlo (MC) 
approach, of which there are several types, comprise that 
branch of experimental mathematics which relies on experi- 
ments using random numbers to infer the response of a sys- 
tem (Hammersley and Handscomb 1964). MC is a general 
method of analysis where chance events are artificially re- 
created numerically (on a computer), the simulation run nu- 
merous times, and the results provide the necessary insights. 
MC methods provide approximate solutions to a variety of 
deterministic and stochastic problems, and hence their wi- 
despread appeal. The application of such methods for uncer- 
tainty propagation calculations was described in Sect. 3.7.3, 
and for low aleatory but high epistemic uncertainty problems 
in Sect. 12.2.7 in the framework of risk analysis and decision 
making involving stochastic system simulation. The many 
advantages of MC methods are: low level of mathematics, 
applicability to a large number of different types of problems, 
ability to account for correlations between inputs, and suita- 
bility to situations where model parameters have unknown 
distributions. They allow synthetic data to be generated from 
observations which have been corrupted with noise, usually 
additive or multiplicative of pre-selected magnitude (with or 
without bias) and specified probability distribution. This all- 
ows one to generate different sets of data sequences of pre- 
selected sample size, from which sampling distributions of 
the parameter estimates can be deduced and their sensitivity 
to the various assumptions evaluated. 

MC methods are numerical methods in that all the uncer- 
tain inputs must be assigned a definite probability distribu- 
tion. For each simulation, one value is selected at random 
for each input based on its probability of occurrence. Nu- 
merous such input sequences are generated and simulations 
performed^. Provided the number of runs is large, the simu- 
lation output values will be normally distributed irrespective 
of the probability distributions of the inputs. Though any 
non-linearity between the inputs and output are accounted 
for, the accuracy of the results depends on the number of 
runs. However, given the power of modern computers, the 
relatively large computational effort is not a major limitation 



- There is a possibility of confusion between the bootstrap and Monte 
Cai'lo simulation approaches. The tie between them is obvious: both are 
based on repetitive sampling and then direct examination of the results. 
A key difference between the methods, however, is that bootstrapping 
uses the original or initial sample as the population from which to resa- 
mple, whereas Monte Carlo simulation is based on setting up a sample 
data generation process for the inputs of the simulation or computa- 
tional model. 
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except in very large simulation studies. The concept of "ef- 
ficiency" has been used to compare different schemes of im- 
plementing MC methods. Say two methods, scheme 1 and 2, 
are to be compared. Method 1 calls for n^ units of computing 
time (i.e., number of times that the simulation is performed), 
while method 2 calls for n^ times. Also, let the resulting es- 
timates of the response variable have variances (j^ and cr|. 
Then, the efficiency of method 2 with respect to method 1 is 
determined as: 



£2 



n\ ■ a 



ni ■ CTf 



(11.6) 



where (n/n^) is called the labor ratio, and (cTj^/ctj^) is called 
the variance ratio. 

MC methods have emerged as a basic and widely used 
tool to quantify uncertainties associated with model predic- 
tions, and also for examining the relative importance of mod- 
el parameters in affecting model performance (Spears et al. 
1994). There are different types of MC methods depending 
on the sampling algorithm of generating the trials (Helton 
and Davis 2003): 

(i) Hit and miss methods which were the historic manner 
of explaining MC methods. They involve using random 
sampling for estimating integrals (i.e., for computing 
areas under a curve and solving differential equations); 
(ii) Crude MC using traditional random sampling where 
each sample element is generated independently follo- 
wing a pre-specified distribution; 
(iii) Stratified MC (also called importance sampling) where 
the population is divided into groups or strata according 
to some pre-specified criterion, and sampling is done 
so that each strata is guaranteed representation (unlike 
the crude MC method). This method is said to be an 
order of magnitude more efficient than the crude MC 
method); 
(iv) Latin hypercube, LHMC, uses stratified sampling wit- 
hout replacement, and is easiest to implement especial- 
ly when the number of variables is large. It can be vie- 
wed as a compromise procedure combining many of the 
desirable features of random and stratified sampling. It 
produces more stable results than random sampling and 
does so more efficiently. It is easier to implement than 
stratified sampling for high dimension problems since 
it is not necessary to determine strata and strata pro- 
babilities. Because of its efficient stratification process, 
LHMC is primarily intended for long-running models. 
LHMC is said to be one of the most promising methods 
for performing sensitivity studies in long-running complex 
models (Hofer 1999). LHMC sampling is conceptually easy 
to grasp. Say a sample of size n is to be generated from 
p=[Pj,P2, P3,...pJ. The range of each parameter p is divided 
into n disjoint intervals of equal probability and one value is 
selected randomly from each interval. The n values thus ob- 



tained for P| are paired at random without replacement with 
similarly obtained n values for p^. These n2-pairs are then 
combined in a random manner without replacement with the 
n values of p^ to form n3-triples. This process is continued 
until a sample of np-tuples is formed. This constitutes one 
LHMC sample. How to modify this method to deal with cor- 
related variables has also been proposed. The paper by Hel- 
ton and Davis (2003) cites over 150 references in the area of 
sensitivity analysis, discusses the clear advantages of LHMC 
for analysis of complex systems, and enumerates the reasons 
for the popularity of such methods. 

(c) Regional Sensitivity Analysis Once the LHMC runs 
have been performed, one needs to identify the strong or in- 
fluential parameters and/or the weak ones appearing in the 
set of modeling equations; this is achieved by a process cal- 
led regional sensitivity analysis. If the "weak" parameters 
can be fixed at their nominal values, and removed from furt- 
her consideration in the calibration process, the parameter 
space would be reduced enormously, and somewhat alleviate 
the "curse of dimensionality". This model reduction is achie- 
ved in two steps: (i) filtering done so as to reject runs that 
fail to meet some prescribed criteria of model performance 
or goodness-of-fit of simulation outputs with measured sys- 
tem performance, and (ii) ascertain from the remaining runs 
which sets of parameters appeared more frequently; these 
can then be assumed to be the influential parameters. 

Since running detailed simulation programs are com- 
putationally intensive and have long run-times, one cannot 
afford to perform separate simulation runs for identifying 
promising parameter vector combinations and for sensitivity 
analysis. Therefore, the following procedure can be adopt- 
ed in order to satisfy both these objectives simultaneously. 
Assume that 30 "candidate" parameter vectors were iden- 
tified with each parameter discretized into three states for 
performing the LHMC runs. One would expect that if the 
individual parameters were "weak" they would be randomly 
distributed among these 30 "candidate" vectors. Thus, the 
extent to which the number of occurrences of an individual 
parameter differs from 10 within each discrete state would 
indicate whether this parameter is strong or weak. This is a 
type of sensitivity test where the weak and strong parameters 
are identified using non-random pattern tests (Saltelli et al. 
2000). The well-known chi-square ;(2 test for comparing 
distributions (see Sect. 2.4. 3g) is advocated to assess statisti- 
cal independence for each and every parameter. First, the ^(^ 
statistic is computed as: 



E 

.s=l 



\Pohs,s Pexp) 
Pexp 



(11.7) 



where p ^ is the observed number of occurrences, and p 

A obs A exp 

is the expected number (in this example above, this will be 
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Table 1 1 .3 Critical tliresliolds for tlie Clii-square statistic witli different significance levels for degrees of freedom 2 

d. f 

2 



a =0.001 
13.815 



a =0.005 
10.597 



a =0.01 
9.210 



a =0.05 



5.991 



a =0.2 
3.219 



a =0.3 
2.408 



a =0.5 
1.386 



a =0.9 



0.211 



10), and the subscript s refers to the index of the state (in this 
case, there are three states). If the observed number is close 

2 

to the expected number, the X value will be small indicating 
that the observed distribution fits the theoretical distribution 
closely. This would imply that the particular parameter is 
weak since the corresponding distribution can be viewed as 
being random. Note that this test requires that the degrees of 
freedom (d. f.) be selected as (number of states- 1), i.e., in 

2 

our case d. f. = 2. The critical values for the X distribution 
for different significance levels a are given in Table. 11.3. 

2 

If the X statistic for a particular parameter is greater than 
9.21, one could assume it to be very strong since the associa- 
ted statistical probability is greater than 99%. On the other 
hand, a parameter having a value of 1.386 (a=0.5) could be 
considered to be weak, and those in between the two values 
as uncertain in influence. 



1 1 .2.4 Case Study: Calibrating Detailed 

Building Energy Simulation Programs 
to Utility Bills 

(a) Background The Oil shock of 1973 led to the widespre- 
ad initiation of Demand Side Management (DSM) projects 
especially targeted at residential and small commercial build- 
ing stock. Subsequently, in the 1980s, building professionals 
started becoming aware of the potential and magnitude of 
energy conservation savings in large buildings (office, com- 
mercial, hospitals, retail...). DSM measures implemented 
included any retrofit or operational practice, usually some 
sort of passive load curtailment measure during the peak 
hours such as installing thermal storage systems, retrofits to 
save energy (such as delamping, energy efficient lamping, 
changing constant air volume HVAC systems into variab- 
le air volume), demand meters in certain equipment (such 
as chillers), and energy management and control systems 
(EMCS) for lighting load management. 

During the last decade, electric market transformation and 
utility de-regulation has led to a new thinking towards more 
pro-active load management of single and multiple build- 
ings. The proper implementation of DSM measures involved 
first the identification of the appropriate energy conserva- 
tion measures (ECMs), and then assessing their impact or 
performance once implemented. This need resulted in moni- 
toring and verification (M&V) activities to acquire a key im- 
portance. Typically, retrofits involved rather simple energy 
conservation measures in numerous more or less identical 
buildings. The economics of these retrofits dictated that as- 
sociated M&V also be low-cost. This led to utility bill analy- 



sis (involving no extra metering cost), and even analyzing 
only a representative sub-set of the entire number of retrofit- 
ted residences or small commercial buildings. In contrast, 
large commercial buildings have much higher utility costs, 
and the HVAC&R devices are not only more complex but 
more numerous as well. Hence, the retrofits were not only 
more extensive but the large cost associated with them justi- 
fied a relatively large budget for M&V as well. The analy- 
sis tools used for DSM projects were found to be too im- 
precise and inadequate, which led to subsequent interest in 
the development of more specialized inverse modeling and 
analysis methods which use monitored energy data from the 
building along with other variables such as climatic variables 
and operating schedules. Most of the topics presented in this 
book have direct relevance to such specialized modeling and 
analysis methods. 

A widely used technique is the calibrated simulation ap- 
proach whereby the input parameters specifying the building 
and equipment necessary to run the detailed building energy 
simulation program are tuned so as to match measured data. 
Such a model/program would potentially allow more reli- 
able and accurate predictions than with regression models 
or statistical approaches. Initial attempts, dating back to the 
early 1980s, involved using utility bills with which to per- 
form the calibration. A large number of energy professionals 
are involved in performing calibrated simulations, and nu- 
merous more profess an active interest in this area. Further, 
the drastic spurt in activity by Energy Service Companies 
(ESCOs) led to numerous papers being published in this area 
(reviewed by Reddy 2006). 

Calibrated simulation can be used for the following pur- 
poses: 

(i) To prove/improve specific models used in a larger simu- 
lation program (Clarke 1993); 
(ii) To provide insight to an owner into his building's ther- 
mal and/or electrical diurnal loadshapes using utility 
bill data (Sonderegger et al. 2001); 
(iii) To provide an electric utility with a breakdown of base- 
line, cooling, and heating energy use for one or several 
buildings based on their utility bills in order to predict 
impact of different load control measures on the aggre- 
gated electrical load (Mayer et al. 2003); 
(iv) To support investment-grade recommendations made 
by an energy auditor tasked to identify cost effective 
ECMs (equipment change, schedule change, control 
settings,..) specific to the individual building and de- 
termine their payback; 
(v) For M&V under one or several of the following cir- 
cumstances (ASHRAE 14 2002): (i) to identify a proper 
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contractual baseline energy use against wliich to mea- 
sure energy savings due to ECM implementation; (ii) 
to allow making corrections to the contractual baseline 
under unanticipated changes (creep in plug load, chan- 
ges in operating hours, changes in occupancy or condi- 
tioned area, addition of new equipment...); (iii) when 
the M&V requires that the effect of a end-use retrofit be 
verified using only whole building monitored data; (iv) 
when retrofits are complex and interactive (ex., lighting 
and chiller retrofits) and the effect of individual retro- 
fits need to be isolated without having to monitor each 
sub-system individually; (v) either pre-retrofit or post- 
retrofit data may be inadequate or not available at all 
(for example, for a new building or if the monitoring 
equipment is installed after the ECM has been imple- 
mented), and (vi) when length of post-retrofit monito- 
ring for verification of savings needs to be reduced; 
(vi) To provide facility/building management services to 
owners and ESCOs the capability of implementing: (i) 
continuous commissioning or fault detection (ED) mea- 
sures to identify equipment malfunction and take ap- 
propriate action (such as tuning/optimizing HVAC and 
primary equipment controls — Claridge and Liu 2001), 
(ii) optimal supervisory control, equipment scheduling 
and operation of building and its systems, either under 
normal operation or under active load control in respon- 
se to real-time price signals. 

(b) Proposed Methodology The case study described in 
this section is a slightly simplified version^ of a research 
study fully documented in Reddy et al. (2007a, b). The ca- 
libration methodology proposed and evaluated is based on 
the concepts described in the previous section which can be 
summarized into the following phases (see Eig. 1 1.4): 
(i) Gather relevant building, equipment and sub-system in- 
formation along with performance data in the form of 
either utility bills and/or hourly monitored data; 
(ii) Identify a building energy program which has the ab- 
ility to simulate the types of building elements and sys- 
tems present, and set up the simulation input file to be 
as realistic as possible; 
(iii) Reduce the dimensionality of the parameter space by 
resorting to walk-thru audits and heuristics. Eor a gi- 
ven building type, identify/define a set of influential 
parameters and building operating schedules along with 
their best-guess estimates (or preferred values) and their 
range of variation characterized by either the minimum 
or the maximum range or the upper and lower 95th pro- 



^ The original study suggested an additional phase involving refining 
the estimates of the strong parameters after the bounded grid search 
was completed. This could be done by one of several methods such as 
analytical optimization or genetic algorithms. This step has been inten- 
tionally left out in order not to overly burden the reader. 



bability threshold values. The set of influential parame- 
ters to be selected should be such that they correspond 
to specific and easy-to-identify inputs to the simulation 
program; 

(iv) Perform a "bounded" grid calibration (or unstructured 
or blind search) using LHMC trials or realizations with 
different combinations of input parameter values. Pre- 
liminary filtering or identification of a small set of the 
trials which meet pre-specified goodness-of-fit criteria 
along with regional sensitivity analysis to provide a me- 
ans of identifying the weak and the strong parameters 
as well as determining narrower bounds of variability of 
these strong parameters; 

(v) The conventional wisdom was that once a simulation 
model is calibrated with actual utility bills, the effect 
of different intended ECMs can be predicted with some 
degree of confidence by making changes to one or more 
of the model input parameters which characterize the 
ECM. Such thinking is clearly erroneous since the utili- 
ty billing data is the aggregate of several end-uses within 
the building, each of which is affected by one or more 
specific and interacting parameters. While performing 
calibration, the many degrees of freedom may produ- 
ce good calibration overall even though the individual 
parameters may be incorrectly identified. Subsequently, 
altering one or more of these incorrectly identified pa- 
rameters to mimic the intended ECM is very likely to 
yield biased predictions. The approach proposed here 
involves using several plausible predictions on which 
to make inferences which partially overcome the danger 
of misleading predictions of "so-called" calibrated mo- 
dels. Thus, rather than using only the "best" calibrated 
solution of the input parameter set (determined on how 
well it fits the data) to make predictions about the effect 
of intended ECMs, a small number of the top plausible 
solutions are identified instead with which to make pre- 
dictions. Not only is one likely to obtain a more robust 
prediction of the energy and demand reductions, but 
this would allow determining their associated predic- 
tion uncertainty as well. 

(c) Description of Buildings The calibration methodology 
was applied to two synthetic office buildings in several loca- 
tions and to one actual office building. The DOE-2 detailed 
public domain hourly building energy simulation program 
(Winkelmann et al. 1993) was used. The calibration was to 
be done for the prevalent case when only monthly utility bil- 
ling data for a whole year were available. Moreover, it was 
presumed that the level and accuracy of knowledge about 
the building geometry, scheduling and various system equip- 
ment would be consistent with a "detailed investment grade" 
audit, involving equipment nameplate information as well 
as some limited onsite measurements (clamp-on meters,...) 
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Fig. 1 1 .4 Flowchart of the 
methodology for calibrating de- 
tailed building energy simulation 
programs. (Modified from Reddy 
et al. 2007a) 
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performed during different times of the day (morning, af- 
ternoon, night) as well as over different days of the week 
in order to better define operating schedules in some of the 
simulation inputs. 

Evaluation using the synthetic buildings involved select- 
ing a building and specifying its various construction and 
equipment parameters as well as its operating schedules 
(called reference values), and using the DOE-2 simulation 
program to generate "electric utility bill" data for a whole 
year coinciding with calendar months. The utility billing 
data are then assumed to be the measured data against which 
calibration is performed. Since the "correct" or reference 
parameters are known beforehand, one can evaluate the ac- 
curacy and robustness of the proposed calibration method- 
ology by determining how correctly the calibrated models 



can fit the utility bill data, and also how accurately the effect 
of various ECMS can be predicted. The results of only one 
synthetic building are presented and discussed below while 
the interested reader can refer to the source documents for 
complete results. 

The synthetic office building (S2) is similar to an actual 
building in terms of construction and mechanical equipment 
and is located in Atlanta, GA. It is a class A large building 
(20,289 m-) with 7 floors and a penthouse with the lobby, 
cafeteria, service areas and mechanical/electrical rooms on 
the first floor, and offices in the remaining floors. Building 
cooling is provided by electricity while heating is met by 
natural gas (in units of Therms). Table. 11.4 assembles a 
list of heuristically-identified influential parameters which 
have simple and clear correspondence to specific and easy to 
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Table 1 1 .4 List of influential parameters for the complex office building category S2. The minimum, base (or best guess), and maximum values 
characterize the assumed range while the reference values are those used to generate the synthetic utility bills used for calibration 


No 




Description 


Variab- 
le type 


Unit 


Minimum 
value 


Base value 


Maximum 
value 


Reference 
case value 


1 


Load schedules 
(rooms) 


Lighting schedule 


D 


NA 


OffLgt_l 


OffLgt_D 


OffLgt_2 


OfflLgt_D 


2 




Equipment schedule 


D 


NA 


OffEqpt_l 


OffEqpt_D 


OffEqpt_2 


OffEqpt_2 


3 




Auxiliary equipment schedule 


D 


NA 


AuxOffEqpt_ 


1 AuxEqpt_D 


AuxOffEqpt_2 AuxEqpt_l 


4 


System schedules 
(zones) 


Fans schedule 


D 


NA 


OffFan_l 


OffFan_D 


OffFan_2 


OffFan_2 


5 




Space heating temperature schedule 


D 


NA 


OfEHtT_l 


OffHtT^D 


OffHtT_2 


OffHtT_l 


6 




Space cooling temperature schedule 


D 


NA 


OffClT_l 


OffClT_D 


OffClT_2 


OffClT_D 


7 




Outside air (ventilation) schedule 


D 


NA 


OffOA_l 


OffOA_D 


OffOA_2 


OffOA_2 


8 


Envelop loads 


Window shading coefficient 


C 


Fraction 


0.16 


0.75 


0.93 


0.8 


9 




Window U value 


C 


Btu-h/°F*Sqfi 


tO.25 


0.57 


1.22 


0.65 


10 




Wall U value 


C 


Btu-h/°F*Sqfi 


t 0.0550 


0.1000 


0.5800 


0.3000 


11 


Internal loads 
(rooms) 


Lighting power density 


C 


W/Sqft 


1.3 


1.7 


2 


1.8 


12 




Equipment power density 


C 


W/Sqft 


0.8 


1.0 


1.2 


0.9 


13 


System variables 


Supply fan power/deha_T 


C 


kW/CFM 


0.00124 


0.00145 


0.00166 


0.00135 


14 




Energy efficiency ratio (EIR) 


C 


Fraction 


0.1564 


0.1849 


0.2275 


0.20478 


15 




On hours control 


D 


NA 


VFD 


IGV 


IGV 


IGV 


16 


Off hours control 


D 


NA 


OFF 


Cycle on an> 
Zone 


' Cycle on any 
Zone 


Cycle on 
any Zone 


17 




Minimum supply air flow 


C 


Fraction 


0.3 


0.65 


1.0 


0.7 


18 




Economizer 


D 


NA 


Yes 


Yes 


No 


Yes 


19 




Minimum outside air 


C 


Fraction 


0.1 


0.3 


0.5 


0.4 


20 


Auxiliary electri- 
cal loads 


Auxiliary electrical loads — non- 
HVAC effect 


c 


kW 


25 


50 


75 


65 


21 




Cooling tower fan power 


c 


BHP/GPM 


0.0118 


0.0184 


0.0212 


0.0189 


22 




Cooling tower fan control 


D 


NA 


VFD 


Two speed 


Single speed 


Two speed 


23 




Primary CHW & cond.pump flow 


C 


GPM/Ton 


2.25 


2.7 


3.38 


2.9 


24 




Boiler efficiency ratio 


C 


Fraction 


1.25 


1.43 


1.54 


1.33 



C continuous, D discrete 



The schedules (parameters 1-7) involve a vector of 24 hourly values which can be found in Reddy et al. (2007a, b) 



identify inputs to the DOE-2 simulation program. The sim- 
ple building category includes 24 parameters consisting of 7 
discrete schedules, 13 continuous parameters, and 4 binary 
parameters (i.e., either on or off or one of only two possi- 
ble categories). Table. 11.4 also contains information about 
the range of the 24 parameters (i.e., the minimum and the 
maximum values which the parameter can assume) as well 
as the base or preferred value, i.e., the values of the various 
parameters which the analyst deems most likely. These may 
be (and are) different from the reference case values which 
were used to generate the synthetic data. It should be pointed 
out that the discrete parameters P1-P7 are diurnal schedules 
which consist of a set of 24 hourly values (which are fully 
described in the source documents). 

(d) Bounded Grid Search This phase, first, involves a blind 
LHMC search with the range of each parameter divided into 
three intervals of equal probability followed by a regional 



sensitivity analysis. Here, Monte Carlo (MC) filtering allows 
rejecting sets of model simulations that fail to meet some 
prescribed criteria of model performance which combines all 
three energy channels (electricity use and demand, and gas 
use). Such goodness of fit (GOF) criteria are based on the 
normalized mean bias error (NMBE) or the coefficient of va- 
riation (CV) or on a combined index which weights both of 
them (GOF-Total). Once a LHMC batch run of several trials 
is completed, the GOF^^y and GOFj^j^g^ indices are computed 
for each trial, from which those parameter vectors which re- 
sult in high GOF numbers, (i.e., those whose predictions fit 
the utility bills poorly) can be weeded out. The information 
contained in these "good" or promising parameter vectors 
can be used to identify the weak parameters which can then 
be removed from further consideration in our calibration 
process. 

For the sake of brevity only a representative selection of 
results are assembled in Table. 11.5 for building S2. Two 
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Table 1 1 .5 Summaiy of various sets of runs for building S#2. The 
range of the 24 input parameters are given in Table. 1 1 .4 Equal weight- 
ing factors for kWh/kW/Gas utility bills are assumed while a ratio of 
3 : 1 has been selected for NMBE: CV values. Calibrated solutions are 
considered feasible if both GOF NMBE and OOF CV<15% 



KWH 



Run 


Number of Number 
LHMC trials of input 

parameters 


Number 
of feasible 
trials 


GOF_Total of top 20 
solutions 






- 




Min 

% 


Max 

% 


Median 

% 


la 


1,500 


24 


141 


5.26 


7.66 


6.75 


2a 


3,000 


24 


326 


3.75 


6.57 


6.00 


3a 


5,000 


24 


548 


2.80 


5.88 


4.85 


2b 


3,000/3,000 


24/19 


391 


3.34 


6.45 


5.94 



a single-stage calibration 

b two stage calibration involves freezing the weak parameters whose 
Chi-square value < 1 .4 



variants have been analyzed: one stage calibration (coded 
"a") and two-stage calibration (coded "b"). The two-stage 
calibration simply refers to a process where an additional 
LHMC set of runs are performed with the weak parameters 
identified during the first set frozen at their best guess val- 
ues'*. This variant was investigated to see whether the cal- 
ibration process is improved as a result. It was found that 
though there was a small improvement (compare runs 2a and 
2b in Table. 11.5), the improvement was not significant. The 
number of trials for each run, as well as the minimum, maxi- 
mum and the median of the top 20 calibrated runs in terms 
of GOF_Total are shown in the table. Equal weighting fac- 
tors for electric energy use (kWh) and demand (kW) and gas 
use were assumed while a ratio of 3 : 1 has been selected for 
NMBE: CV values while computing GOF-Total. Calibrated 
solutions are considered feasible if both GOF_NMBE and 
G0F_CV<15% which is consistent with the recommenda- 
tion of ASHRAE Guideline 14 (2002). This translated into 
GOF_Total of 11% when using these weights. One notes 
from Table. 11.5 that only about 10% of the total number 
of LHMC trials are deemed to contain feasible sets of input 
parameters. 

As expected, the number of LHMC trials is a major fac- 
tor; increasing this number results in better calibration. For 
good accuracy of fits it would seem that about 3,000-5,000 
calibration trials would be needed. Figure 11.5 indicates 
how the GOF indices (NMBE and CV) of the three energy 
use quantities scatter about the origin for each of the 3,000 
LHMC trials. Only the parameter vectors corresponding to 
those runs which cluster around the origin are selected for 
further analysis involving identifying strong and weak pa- 
rameters. How well the best run (i.e., the one with lowest 



* There were a large number of parameters identified as strong param- 
eters (about 8-12 out of 20 input parameters). Further, it was not al- 
ways clear as to which of the three equal-probability intervals to select. 
Hence, for the two-stage calibration, it was more practical to freeze the 
weak parameters rather than the strong parameters. 
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Fig. 11.5 Scatter plots depicting goodness-of-fit of the three energy 
use channels with 3,000 LHMC trials. Only those of the numerous trials 
close to the origin are then used for regional sensitivity analysis, a Elec- 
tricity use. b Electricity demand, c Gas thermal energy 

GOF_Total) for the 3,000 LHMC trials is able to track the 
actual utility bill data is illustrated by the three time series 
plots of Fig. 11.6. One notices that the fits are excellent (in 
fact, the top dozen or so runs are very close, and those from a 
LHMC run with 5,000 trials, even better). An upper limit to 
the calibration accuracy for the best trial seems to be about 
2%, and for the median of the top 20 calibrated solutions to 
be around 4-7%. 

(e) Uncertainty in Calibrated Model Predictions The issue 
of specific relevance here relates to the accuracy with which 
the calibrated solutions are likely to predict energy and demand 
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Fig. 11.6 Time series plots of the three energy use channels for the best calibrated trial corresponding to one-stage run with 3,000 LHMC trials 
(Run 2a of Table. 1 1.5). a Electricity use. b Electricity demand, c Gas thermal energy 



savings if specific energy conservation measures (ECMs) 
were to be implemented. In other words, one would like to 
investigate the ECM savings and its associated uncertainty 
as predicted by the calibrated solutions. As stated earlier, re- 
lying solely on the predictions of any one calibration (even 
if it were to fit the utility bill data very accurately) is unadvi- 
sed, since it provides no guarantee that individual parameters 
have been accurately tuned. Instead, the calibration approach 
is likely to be more robust if a small set of most plausible solu- 
tions are identified instead. The effect of large deviations from 
any individual outlier predictions can be greatly minimized (if 
not eliminated) by positing that the actual value is likely to 



be bracketed by the inter-quartile standard deviation around 
the median value. This would provide greater robustness by 
discounting the effect of outlier predictions. The top 20 ca- 
librated solutions were selected (somewhat arbitrarily). The 
median and the inter-quartile standard deviation (i.e., the 10 
trials whose predicted savings are between the 25 and 75% 
percentiles) are then calculated from these 20 predictions. 

The predictive accuracy of the calibrated simulations have 
been investigated using four different sets of ECM measures, 
some of which would result in increased energy use and de- 
mand, such as ECM_C (Table. 1 1.6). Retrofits for ECM_A in- 
volve modifying the minimum supply ciir flow (parameter PI 7) 



Table 11 .6 Predictions of electricity use and monthly demand savings for synthetic building S2 by various calibration runs. The "con'ect" savings 
and the savings predicted by the inter-quartile values of the Top 20 calibrated solutions are shown for the five ECM measures simulated 

ECM Parameters Baseline values New Run" Number of 

number values LHMC trials 



Median Predicted savings (Top20 solutions) 

GOF_Total kwh% kW% 















Median % 


Stdev" 


Median % 
8.76 


Stdev" 


ECM_A P17/P19 


0.65/0.30 


0.30/0.10 


Exact 


- 


- 


18.47 


0.00 


0.00 








2a 


1,500 


6.75 


19.08 


0.98 


12.79 


1.32 








4a 


3,000 


6.00 


18.06 


4.73 


12.00 


1.85 








4b 


3,000 


5.94 


18.43 


1.75 


14.00 


2.33 








6a 


5,000 


4.85 


18.85 


1.33 


12.31 


1.07 


ECM_B P17/P19 


0.65/0.30 


0.20/0.25 


Exact 


- 


- 


19.69 


0.00 


6.06 


0.00 



2a 


1,500 


6.75 


20.05 


0.68 


11.35 


4a 


3,000 


6.00 


19.32 


4.22 


9.68 


4b 


3,000 


5.94 


20.58 


2.00 


11.26 



' a refers to one-stage and h refers to two-stage calibration 

'' Stdev standard deviation of only those runs in the inter-quartile range (i.e., between 25 and 75%) 



1.04_ 

2.22 
2.24 







6a 


5,000 


4.85 


20.12 


1.24 


9.98 


1.24 


ECM_C P11/P12 1.67/1.0 


3.0/3.0 


Exact 


- 


- 


-50.31 


0.00 


-58.63 


0.00 






2a 


1,500 


6.75 


^7.79 


1.26 


-52.92 


0.75 






4a 


3,000 


6.00 


-50.32 


2.90 


-55.39 


1.25 






4b 


3,000 


5.94 


^8.56 


2.57 


-54.99 


1.06 






6a 


5,000 


4.85 


^5.43 


1.88 


-54.10 


0.88 


ECM_D P8/P9/P14 0.64/0.64/0.19 


0.40/0.4/0.50 


Exact 


- 


- 


-30.32 


0.00 


-36.76 


0.00 






2a 


1,500 


6.75 


-33.65 


5.47 


-29.74 


3.28 






4a 


3,000 


6.00 


-32.97 


3.52 


-35.10 


2.86 






4b 


3,000 


5.94 


-29.20 


4.86 


-33.82 


3.64 






6a 


5,000 


4.85 


-26.42 


1.33 


-29.61 


1.46 
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and the minimum outdoor air fraction (PI 9), both of which are 
strong parameters. The reference case corresponds to a value of 
P17=0.30 (reduced from 0.65) and a value of P19=0.1 (down 
from 0.3). The results are computed as the predicted % savings 
in energy use and demand compared to the original building. 
For example, the % savings in kWh have been calculated as: 



%KWH = 100 

Baseline annual kWh — Predicted annual kWh 

X 

Baseline annual kWh 



(11.8) 



Table. 11.6 assembles the results of all the runs for all four 
ECMs considered. The base values of the parameters as 
well as their altered values are also indicated. For example, 
ECM_A entails changing the baseline values of 0.65 and 
0.30 for P17 and P19 respectively to 0.3 and 0. 1 respectively. 
The "correct" savings values of the three channels of ener- 
gy data are shown bolded. For example, the above ECM_A 
change would lower electricity use and demand by 18.47% 
and 8.76% respectively. The same information is plotted in 
Fig. 11.7 for easier visualization where the correct values are 
shown as boxes without whiskers. One note that the calibrat- 
ion methodology seems to work satisfactorily for the most 
part (with the target values contained within the spread of 
the whiskers of the inter-quartile range of the top 20 cali- 
brated simulation predictions) for kWh savings though there 
are deviations for ECM_C and ECM_D. However, there are 
important differences for monthly demand (kW savings), 
especially for ECM_A and ECM_B. This is somewhat to be 
expected considering that electric demand for each month is 
a one-time occurrence and is likely to be harder to pin down 
accurately. Monthly electricity use, on the other hand, is a 
value summed over all hours of each month and is likely to 
be more stable. 

Generally, it would seem that LHMC trials should be of 
the order of 3,000 trials/run for this building. Further, it is 
also noted that there is little benefit in performing a two- 
stage calibration. Another observation from Fig. 11.7 is that, 
generally, the uncertainty bands for the high 5,000 trials/run 
cases are narrower compared to those for the 3,000 trial/run 
cases but the median values are not less biased (in fact they 
seem to be worse in a slight majority of cases). This trend 
suggests that the danger of biased predictions does not im- 
prove with larger number of LHMC runs, and other solutions 
need to be considered. 

There were several ad hoc elements in the above calibra- 
tion methodology. Another study by Sun and Reddy (2006) 
attempted to frame the problem in a general analytic frame- 
work with firmer mathematical and statistical basis. For ex- 
ample, the number of parameters deemed influential were se- 
lected by first normalizing them against the Euclidean norm, 
while an attempt was made to infer the number of parameters 
which one could hope to tune with the data at hand from the 



order of the highest order definite submatrix (or rank of the 
Hessian marix) whose condition number is less than a pre- 
defined threshold value (refer back to Sect. 10.2 for explana- 
tion of these terms). The overall conclusion was that trying 
to calibrate a detailed simulation program with only utility 
bill information is never likely to be satisfactory because of 
the numerical identifiability issue (described in Sect. 10.2.3). 
One needs to either enrich the data set by non-intrusive sub- 
monitoring, at say, hourly time scales for a few months if not 
the whole year, or reduce the number of strong parameters 
to be tuned by performing controlled experiments when the 
building is unoccupied, and estimate their numerical values 
(the LHMC process could be used to identify the strong pa- 
rameters needing to be thus determined). This case study 
example was meant to illustrate the general approach; this 
whole area of calibrating detailed building energy simulation 
programs is still an area requiring further research before it 
can be used routinely by the professional community. 



1 1 .3 Model Selection and Identifiability 

11.3.1 Basic Notions 

Model selection is the process of identifying the functional 
form or model structure, while parameter estimation is the 
process of identifying the parameters in the model. These 
two distinct but related issues are often jointly refen^ed to 
as system identification^ (terms such as complete identifica- 
tion, structure recovery, and system characterization are also 
used). Note that the use of the word estimation has a conno- 
tation of uncertainty. This uncertainty in determining the pa- 
rameters of the model which describe the physical system is 
unavoidable, and arises from simplification en^ors and noise 
invariably present both in the physical model and in the mea- 
surements. The challenge is to decide on how best to simpli- 
fy the model and design the associated experimental protocol 
so as to minimize the uncertainty in our parameter estima- 
tes from data corrupted by noise. Thus, such analyses not 
only require knowledge of the physical behavior of the sys- 
tem, but also presume expertise in modeling, designing and 
performing experiments and in regression/search methods. 
Many of the chapters in this book (specifically Chaps. 1,5, 
6, 7, 9 and 10) directly pertain to such issues. 

System identification is formally defined as the deter- 
mination of a mathematical model of a system based on its 
input-output information over a relatively short period of 
time, usually for either understanding the system or phe- 



^ Though there is a connotational difference between the words "iden- 
tification" and "estimation" in the English language, no such difference 
is usually made in the field of inverse modeling. Estimation is a term 
widely used in statistical mathematics to denote a similar effect as the 
term identification which appears in electrical engineering literature. 
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Fig. 11.7 Electricity use (kWh) 
and montlily demand (kW) 
savings (in % of the baseline 
building values) for the four 
ECM measures predicted by the 
top20 calibrated solutions from 
different number of LHMC trials 
(see Table. 11.6). The "correct" 
values are the ones without 
whiskers 
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nomenon being studied, or for making predictions of sys- 
tem behavior. It deals with the choice of a specific model 
for a class of models that are mathematically equivalent to 
a given physical system. Model selection problems involve 
under-constrained problems with degrees of freedom greater 
than zero where an infinite number of solutions are possi- 
ble. One differentiates between (i) situations when nothing 
is known about the system functional behavior, sometimes 



referred to as complete identification problems, requiring the 
use of black-box models, and (ii) partial identification prob- 
lems wherein some insights are available and allow grey-box 
models to be framed to analyze the data at hand. The objec- 
tives are to identify the most plausible system models and 
the most likely parameters/properties of the system by per- 
forming certain statistical analyses and experiments. The dif- 
ficulty is that several mathematical expressions may appear 
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to explain the input-output relationships, partially due to the 
presence of unavoidable errors or noise in the measurements 
and/or limitations in the quantity and spread of data avail- 
able. Usually, the data is such that it can only support models 
with limited number of parameters. Hence, by necessity (as 
against choice) are the mathematical inverse models mac- 
roscopic in nature, and usually allow determination of only 
certain essential properties of the system. Thus, an important 
and key conceptual difference between forward (also called 
microscopic or microdynamic) models and system identifi- 
cation inverse problems (also called macrodynamic) is that 
one should realistically expect the latter to involve models 
containing only a few aggregate interactions and parameters. 
Note that knowledge of the physical (or structural) param- 
eters is not absolutely necessary if internal prediction or a 
future (in time) forecast is the only purpose. Such forecasts 
can be obtained through the reduced form equations directly 
and exact deduction of parameters is not necessary (Pindyck 
andRubinfeld 1981). 

Classical inverse estimation methods can be enhanced by 
using the Bayesian estimation method which is more subtle 
in that it allows one to include prior subjective knowledge 
about the value of the estimate or the probability of the un- 
known parameters in conjunction with information provided 
by the sample data (see Sect. 2.5 and 4.6). The prior infor- 
mation can be used in the form of numerical results obtained 
from previous tests by other researchers or even by the same 
researcher on identical equipment. Including prior informa- 
tion allows better estimates. For larger samples, the Bayesian 
estimates and the classical estimates are practically the same. 
It is when sample are small that Bayesian estimation is par- 
ticularly useful. 

The following sub-sections will present the artificial neu- 
ral network approach to identifying black-box models and 
also discuss modeling issues and techniques relevant to grey- 
box models. 



1 1 .3.2 Local Regression— LOWESS Smoothing 
Method 

Sometimes the data is so noisy that the underlying trend may 
be obscured by the data scatter. A non-parametric black-box 
method called the "locally weighted scatter plot smoot- 
her" or LOWESS (Devore and Farnum 2005) can be used 
to smoothen the data scatter. Instead of using all the data 
(such as done in traditional ordinary least squares fitting), 
the intent is to fit a series of lines (usually polynomial func- 
tions) using a prespecified portion of the data. Say, one has 
n pairs of (x, y) observations, and one elects to use subsets 
of 20% of the data at a time. For each individual xq point, 
one selects 20% of the closest x-points, and fit a polyno- 
mial line with only this subset of data. One then uses this 



model to predict the corresponding value of the response va- 
riable yy at the individual point x^. This process is repeated 
for each of the n points so that one gets n sets of points of 
(xo, yo) ■ The LOWESS plot is simply the plot of these n 
sets of data points. Figure 11.8 illustrates a case where the 
LOWESS plot indicated a trend which one would be hard 
pressed to detect in the original data. Looking at Fig. 11.8a 
which is a scatter plot of the characteristics of bears, with x 
as the chest girth of the bear, and y its weight, one can faintly 
detect a non-linear behavior; but it is not too clear. Howe- 
ver, the same data when subject to the LOWESS procedure 
(Fig. 1 1 .8b) assuming a pre-specified portion of 50% reveals 
a clear bi-linear behavior with a steeper trend when x>38. 
In conclusion, LOWESS is a powerful functional estimation 
method which is local (as suggested by the name), non- 
parametric, and can potentially capture any arbitrary feature 
present in the data. 
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Fig. 11.8 Example to illustrate the insight which can be provided by 
the LOWESS smoothing procedure. The data set is meant to detect the 
underlying pattern between the weight of a wild bear and its chest girth. 
While the traditional manner of plotting the data (frame a) suggests 
a linear function, (frame b) assuming a 50% prespecified smoothing 
length, reveals a bi-Iinear behavior. (From Devore and Farnum 2005 by 
© permission of Cengage Learning) 
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1 1 .3.3 Neural Networks — Multi-Layer 
Perceptron (MLP) 

A widely used black-box approach is based on neural net- 
works (NN) which are a class of mathematical models based 
on the manner in which the human brain functions while per- 
forming such activities as decision-making, pattern or speech 
recognition, image or signal processing, system prediction, 
optimization and control (Wasserman 1989). NN grew out of 
research in artificial intelligence, and hence, they are often 
referred to as artificial neural networks. NN possess several 
unique attributes that allow them to be superior to the tradi- 
tional methods of knowledge acquisition (of which data ana- 
lysis and modeling is one specific activity). They have the 
ability to: (i) exploit a large amount of data/information, (ii) 
respond quickly to varying conditions, (iii) learn from exam- 
ples, and to generalize underlying rules of system behavior, 
(iv) map complex nonlinear behavior for which input-output 
variable set is known but not their structural interaction (viz, 
black-box models), and (v) ability to handle noisy data, i.e., 
have good robustness. The last 30 years have seen an explo- 
sion in the use of NN. These have been successfully applied 
across numerous disciplines such as engineering, physics, 
finance, psychology, information theory and medicine. For 
example in engineering, NN have been used for system mo- 
deling and control, short-term electric load forecasting, fault 
detection and control of complex systems. Stock market pre- 
diction and classification in terms of credit- worthiness of in- 
dividuals applying for credit cards are examples of financial 
applications. Medical applications of NN involve predicting 
the probable onset of certain medical conditions based on 
a variety of health-related indices. In this book, let us limit 
ourselves to its model building capability. 

There are numerous NN topologies by which the relati- 
onship between one or more response variable(s) and one or 
more regressor variable(s) can be framed (Wasserman 1989; 
Fausett 1993). A widely used architecture for applications 
involving predicting system behavior is the feed-forward 
multi-layer perceptron (MLP). A typical MLP consists of an 
input layer, one or more hidden layers and an output layer 
(see Fig. 11.9). The input layer is made up of discrete nodes 
(or neurons, or units), each of which represents a single re- 
gressor variable, while each node of the output layer rep- 
resents a single response variable. Only one output node is 
shown but the extension to a set of response variables is 
obvious. Networks can consist of various topographies; for 
example, with no hidden layers (though this is uncommon), 
with one hidden layer, and with numerous hidden layers. 
It is the wide consensus that except in rare circumstances, 
a MLP architecture with only one hidden layer is usually 
adequate for most applications. While the number of nodes 
in the input and output layers are dictated by the specific 
problem at hand, the number of nodes in each of the hidden 
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Fig. 11.9 Feed-forward multilayer perceptron (MLP) topology for a 
network with 3 nodes in the input layer, 3 in the hidden layer and one 
in the output layer denoted by MLP(3,3, 1). The weights for each of the 
interactions between the input and hidden layer nodes are denoted by 
(W||,...w^,) while those between the hidden and the output nodes are 
denoted by (Vju, v,^, \^^. Extending the architecture to deal with more 
nodes in any layer and to more number of hidden layers is intuitively 
straightforward. The square nodes indicate those where some sort of 
processing is done as elaborated in Fig. 1 1.10 



layers is a design choice (certain heuristics are described 
further below). 

Each node of the input and first hidden layers are con- 
nected by lines to indicate information flow, and so on for 
each successive layer till the output layer is reached. This 
is why such a representation is called "feed-forward". The 
input nodes or the X vector are multiplied by an associative 
weight vector W representative of the strength of the specific 
connection. These are summed so that (see Figs. 11.9 and 
11.10): 

NET ^{^A'\\Xx -|-W2iX2-hW3iA:3) = ^XIV (11.9) 

For each node i, the NET signal is then processed further by 
an activation function: 
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Fig. 11.10 The three computational steps done at processing node H^ 
represented by square nodes in Fig. 11.9. The incoming signals are 
weighted and summed to yield the NET which is then transformed non- 
linearly by a squashing (or basis function) to which a bias term is added. 
The resulting OUT signal becomes the input to the next node downstre- 
am to which it is connected, and so on. These steps are done at each of 
the processing nodes 
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The activation function f(NET) (also called basis function or 
squashing function) can be a linear mapping function (this 
would lead to the simple linear regression model), but more 
generally, the following monotone sigmoid forms are adop- 
ted: 



f{NET,) = [1 + exp( - NETi)]-^ or 
f(NET,) = tanh (NETi) 



(11.11) 



Logistic or hyperbolic functions are selected because of 
their ability to squash or limit the values of the output wit- 
hin certain limits such as (-1,1). It is this mapping which 
allows non-linearity to be introduced in the model structu- 
re. The bias term b is called the activation threshold for the 
corresponding node and is introduced to avoid the activation 
function getting stuck in the saturated or limiting tails of the 
function. As shown in Fig. 1 1 .9, this process is continued till 
an estimate of the output variable >" is found. 

The weight structure determines the total network behav- 
ior. Training the MLP network is done by adjusting the net- 
work weights in an orderly manner such that each iteration 
(referred to as "epoch" in NN terminology) results in a step 
closer to the final value. Numerous epochs (of the order of 
100,000 or more) are typically needed; a simple and quick 
task for modern personal computers. The gradual conver- 
gence is very similar to the gradient descent method used in 
nonlinear optimization or during estimating parameters of a 
non-linear model. The loss function is usually the squared 
error of the model residuals just as done in OLS. Adjusting 
the weights as to minimize this error function is called train- 
ing the network (some use the terminology "learning by the 
network"). The most commonly used algorithm to perform 
this task is the back-propagation training algorithm where 
partial derivatives (reflective of the sensitivity coefficients) 
of the error surface are used to determine the new search 
direction. The step size is determined by a user-determined 
learning rate selected so as to hasten the search but not lead 
to over-shooting and instability (notice the similarity of the 
entire process with the traditional non-linear search method- 
ology described in Sect. 7.4). As with any model identifica- 
tion where one has the possibility of adding a large number 
of model parameters, there is the distinct possibility of over- 
fitting or over-training, i.e., fitting a structure to the noise in 
the data. Hence, it is essential that a sample cross-validation 
scheme be adopted such as that used in traditional regres- 
sion model identification (see Sect. 5.3.2). In fact, during 
MLP modeling, the recommended approach is to sub-divide 
the data set used to train the MLP into three subsets: 
(i) the training data set used to evaluate different MLP ar- 
chitectures and to train the weights of the nodes in the 
hidden layer; 
(ii) validation data set meant to monitor the performance 
of the MLP during training. If the network is allowed to 
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Fig. 11.11 Conceptual figure illustrating how to select the optimal 
MLP network weights based on the RMSE errors from the training and 
validation data sets 

train too long, it tends to over-train leading to a loss in 
generality of the model; 
(iii) testing data set, similar to the cross-validation data set, 
meant to evaluate the predictive or external accuracy 
of the MLP network using such indices as the CV and 
NMBE. These are often referred to as generalization 
errors. 
The usefulness of the validation data set during training 
of a specific network is illustrated in Fig. 11.11. During the 
early stages of training, the RMSE of both the training and 
validation data sets drop at the same rate. The RMSE for 
the training data set keeps on decreasing as more iterations 
(or epochs) are performed. At each epoch, the trained model 
is applied to the validation data set, and the corresponding 
RMSE error computed. The validation error eventually starts 
to rise; it is at this point that training ought to be stopped. The 
node weights at this point are the ones which correspond to 
the optimal model. Note, however, that this process is spe- 
cific to a preselected architecture, and so different architec- 
tures need to be tried in a similar manner. Compare this proc- 
ess with the traditional regression model building approach 
wherein the optimal solution is given by closed form solu- 
tions and no search or training is needed. However, the need 
to discriminate between competing model functions still ex- 
ists and statistical indices such as RMSE or R^ and adjusted 
R^ are used for this purpose (see Sect. 5.3.2). 

MLP networks have also been used to model and forecast 
dynamic system behavior as well as time series data over a 
time horizon. There are two widely used architectures. The 
time series feed forward network shown in Fig. 11.12a is 
easier and more straightforward to understand and imple- 
ment, and so is widely used. A recurrent network is one 
where the outputs of the nodes in the hidden layer are fed 
back as inputs to previous layers. This allows a higher level 
of non-linearity to be mapped, but they are more generally 
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Fig. 1 1 .1 2 Two different MLP 
model architectures appropriate 
for time series modeling. The 
simpler feed forward network 
uses time-lagged values till time 
t to predict a future value at time 
(t+ 1). Recurrent networks use 
internally generated past data 
with only the current value for 
prediction, and are said to be ge- 
nerally more powerful while, ho- 
wever, needing inore expertise in 
their proper use. a Feedforward 
network, b Recurrent network. 
(From SPSS 1997) 
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difficult to train properly. Many types of interconnections are 
possible with the more straightforward network arrangement 
for one hidden layer shown in Fig. 1 1.12b. The networks as- 
sumed in these figures use the past values of the variable it- 
self; extensions to the multivariate case would involve inves- 
tigating different combinations of present and lagged values 
of the regressor variables as well. 

Some useful heuristics with MLP training are assembled 
below (note that some of these closely parallel those fol- 
lowed by traditional model fitting): 
(i) in most cases, one hidden layer should be adequate; 
(ii) start with a small set of regressor variables deemed most 
influential, and gradually include additional variables 
only if the magnitude of the model residuals decrease 
and if this leads to better behavior of the residuals. This 
is illustrated in Fig. 11.13 where the use of one regressor 
results in very improper model residuals which is greatly 
reduced as two more relevant regressor variables are in- 
troduced; 
(iii) the number of training data points should not be less 
than about 10 times the total number of weights to be 
tuned. Unfortunately, this criterion is not met in some 
published studies; 
(iv) the number of nodes in the hidden layer should be di- 
rectly related to the complexity /non-linearity of the pro- 
blem. Often, this number should be in the range of 1-3 
times the number of regressor variables; 
(v) the original data set should be split such that training 
uses about 50-60% of the number of observations, with 
the validation and the testing using about 20-25% each. 
All three subsets should be chosen randomly; 
(vi) as with non-linear parameter estimation, training should 
be repeated with different plausible architectures, and 
further, each of the architectures should be trained using 



different mixes of training/validation/testing subsets so 
as to avoid the pitfalls of local minima. Commercial 
software are available which automate this process, i.e., 
train a number of different architectures and let the ana- 
lyst select the one he deems most appropriate. 
Two examples of MLP modeling from the published 
building energy literature dealing with predicting building 
response over time are discussed below. Kawashima et al. 
(1998) evaluated several time series modeling methods for 
predicting the hourly thermal load of a building over a 24 h 
time horizon using current and lagged values of outdoor tem- 
perature and solar insolation. Comparative results in terms of 
the CV and NMBE statistics during several days in summer 
and in winter are assembled in Table. 11.7 for five methods 
(out of a total of seven methods in the original study), all of 
which used 15 regressor variables. It is quite clear that the 
MLP models are vastly superior in predictive accuracy to the 
traditional methods such as ARIMA, EWMA and Linear re- 
gression (the last usedl5 regressor variables). The recurrent 
model is slightly superior to the feed-forward MLP model. 
However, the MLP models, both of which have 15 nodes in 
the input layer and 3 1 nodes in the single hidden layer, imply 
that a rather complex model was needed to adequately cap- 
ture the short-term forecasts. 

Let us briefly summarize another study (Miller and Seem 
1991) in order to point out that many instances only warrant 
simple MLP architectures, and that, even then, traditional 
methods may be more appropriate in practice. The study 
compared MLP models with traditional methods (namely, 
the recursive least squares) meant to predict the amount of 
time needed for a room to return to its desired temperature 
after night or weekend thermostat set-back. Only two input 
variables were used: the room temperature and the outdoor 
temperature. It was found that the best MLP was one with 
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Fig. 11.13 Scatter plots of two different MLP architectures fit to the 
same data set to show the importance of including the proper input va- 
riables. Cooling loads of a building are being predicted against outdoor 
dry-bulb and humidity and internal loads as the three regressor variab- 
les. Both the magnitude and the non-uniform behavior of the model 
residuals ai'e greatly reduced as a result, a Residuals of MLP(l-lO-l) 
with outdoor temperature as the only regressor. b Residual plot, c Mea- 
sured vs modeled plot of MLP(3-10-1) with three regressors 



Table 11.7 Accuracy of different modeling approaches in predicting 
hourly thermal load of a building over a 24 h time horizon. (Extracted 
from Kawashima et al. 1998) 



Model type 


Coefficient of 
variation (%) 


Normalizec 
biased erroi 


mean 

(%) 




Winter 


Summer 


Winter 


Summer 


Auto regressive integ- 
rated moving average 
(ARIMA) 


27.7 


34.4 


1.6 


1.0 


Exponential weighted 
moving average (EWMA) 


12.0 


26.0 


1.8 


3.3 


Linear regression with 
15 regressors 


27.8 


21.4 


-2.0 


-1.3 


Multi-layer Perception — 
MLP(15,31,1) 


11.2 


9.1 


-0.5 


-0.2 


Recurrent Multi- 
layer Perceptron — 
MLP(15,31,1) 


9.3 


6.8 


-0.5 


-0.4 



one hidden layer with two nodes even though evaluations 
were done with two hidden layers and up to 24 hidden nodes. 
The general conclusion was that even though the RMS errors 
and the maximum error of the MLP were slightly lower, the 
improvement was not significant enough to justify the more 
complex and expensive implementation cost of the MLP al- 
gorithm in actual HVAC controller hardware as compared to 
more traditional approaches.'' 

For someone with a more traditional background and out- 
look to model building, a sense of unease is felt when first 
exposed to MLP. Not only is a clear structure or functional 
form lacking even after the topography and weights are de- 
termined, but the "model" identification is also somewhat of 
an art. Further, there is the unfortunate tendency among sev- 
eral analysts to apply MLP to problems which can be solved 
more easily by traditional methods, with more transpar- 
ency and allow clearer physical interpretation of the model 
structure and of its parameters. MLP should be viewed as 
another tool in the arsenal of the data analyst and, despite 
its power and versatility, not as the sole one. As with any 
new approach, repeated use and careful analysis of MLP re- 
sults will gradually lead to the analyst gaining familiarity, 
discrimination of when to use it, and increased confidence 
in its proper use. The interested reader can refer to several 
excellent textbooks of varied level of theoretical complexity; 
for example, Wasserman (1989); Fausett (1993), and Hay- 
kin (1999). The MLP architecture is said to be "inspired" by 
how the human brain functions. This is quite a stretch, and 
at best a pale replication, considering that the brain typically 
has about 10 billion neurons with each neuron having several 
thousands of interconnections ! 



^ This is a good example of the quote by Einstein expressing the view 
that ought to be followed by all good analysts: "Everything should be 
as simple as possible, hut not simpler". 
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1 1 .3.4 Grey-Box Models and Policy Issues 

Concerning Dose-Response Behavior 

Example 1.3.2 in Chap. 1 discussed three methods of extra- 
polating dose-response curves down to low doses using ob- 
served laboratory tests performed at high doses. While the 
three types of models agree at high doses, they deviate sub- 
stantially at low doses because the models are functionally 
different (Fig. 1.16). Further, such tests are done on laborato- 
ry animals, and how well they reflect actual human response 
is also suspect. In such cases, model selection is based more 
on policy decisions rather than how well a model fits the 
data. This aspect is illustrated below using grey-box models 
based on simplified but phenomenological assumptions of 
how biological cells become cancerous. 

This section will discuss the use of inverse models to an 
application involving modeling risk to humans when ex- 
posed to toxins. Toxins are biological poisons usually pro- 
duced by bacteria or fungi under adverse conditions such as 
shortage of nutrients, water or space. They are, generally, 
extremely deadly even in small doses. Dose is the variable 
describing the total mass of toxin which the human body in- 
gests (either by inhalation or by food/water intake) and is a 
function of the toxin concentration and duration of exposure 
(some models are based on the rate of ingestion, not simply 
the dose amount). Response is the measurable physiological 
change in the body produced by the toxin which has many 
manifestations, but here the focus will be on human cells 
becoming cancerous. Since different humans (and test ani- 
mals) react different to the same dose, the response is often 
interpreted as a probability of cancer being induced, which 
can be interpreted as a risk. Responses may have either no 
or small threshold values to the injected dose coupled with 
linear or non-linear behavior (see Fig. 1.16). 

Dose-response curves passing through the origin are 
considered to apply to carcinogens, and models have been 
suggested to describe their behavior The risk or probability 
of infection to a toxic agent with time-variant concentration 
C(t) from times t to t, is provided by Haber's law: 



R(C,t) 



•2 

-7 



C(t)dt 



(11.12a) 



where k is a proportionality constant of the specific toxin and 
is representative of the slope of the curve. 

The non-linear dose-response behavior is modeled using 
the toxic load equation: 



R{C,t) 



n 
:/c" 



(t)dt 



(11.12b) 



where n is the toxic load exponent and depends on the toxin. 
The value of n generally varies between 2.00 and 2.75. The 
implication of n = 2 is that if a given concentration is doubled 
with the exposure time remaining unaltered, the response in- 
creases fourfold (and not by twice as predicted by a linear 
model). 

The above models are somewhat empirical (or black- 
box) and are useful as performance models. However, they 
provide little understanding or insights of the basic process 
itself. Grey-box models based on simplified but phenom- 
enological considerations of how biological cells become 
cancerous have also been proposed. Though other model 
types have also been suggested (such as probit and logistic 
models discussed in Sect. 10.4.4), the Poisson distribution 
(Sect. 2.4.2f) is appropriate since it describes the number of 
occurrences of isolated independent events when the prob- 
ability of a single outcome occurring over a very short time 
period is proportional to the length of the time interval. The 
process by which a tumor spreads in a body has been mod- 
eled by multi-stage multi-hit models of which the simpler 
ones are shown in Fig. 11.14. The probability of getting a 
hit (i.e., a cancerous cell coming in contact with a normal 
cell) in a n-hit model is proportional to the dose level and 
the number n of hits necessary to cause the onset of cancer. 
Hence, in a one-hit model, one hit is enough to cause a toxic 
response in the cell with a known probability; in a two-hit 
model, the probabilities are treated as random independent 



One-hit Carcinogen 



Two-hit Carcinogen 



Normal 
cell 

>o - 



Malignant 
cell 



>o— i^o 



Carcinogen 



Intermediate 
cell 



Two-stage Carcinogen 1 ► Q ► (^ 



Carcinogen 2 



Three-stage w/ multihit 



Carcinogen 1 ► (^- 



Intermediate 
cells 



^4 



Carcinogen 2 



Carcinogen 3 



Carcinogen 2 

Fig. 11.14 Grey-box models for dose-response are based on different 
phenomenological presumptions of how human cells react to exposure. 
(From Kammen and Hassenzahl 1999 by permission of Princeton Uni- 
versity Press) 
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Fig. 11.15 Two very different models of dose-response behavior with 
different phenomenological basis. The range of interest is well below 
the range where data actually exist. (From Crump 1984) 



events. Thus, the probabihties of the response cumulate as 
independent random events. The two-stage model looks su- 
perficially similar to the two-hit model; but is based on a 
distinctly different phenomenological premise. The process 
is treated as one where the cell goes through a number of 
distinct stages with each stage leading it gradually towards 
becoming carcinogenous by disabling certain specific func- 
tions of the cell (such as tumor suppression capability,...). 
This results in the dose effects of each successive hit cu- 
mulating non-linearly and exhibiting an upward-curving 
function. Thus, the multistage process is modeled as one 
where the accumulation of dose is not linear but includes 
historic information of past hits in a non-linear manner (see 
Fig. 11.15). 

Following Masters and Ela (2008), the one-hit model ex- 
presses the probability of cancer onset P(d) as: 



P(d) = 1 - exp ( - 90 



xd) 



(11.13) 



where d is the dose, and q^ and q^ are empirical best fit pa- 
rameters. However, cancer can be induced by other causes 
as well (called "background" causes). Let P(0) be the back- 
ground rate of cancer incidence) corresponding to d=0. 
Then, since exp(x) ~ 1 + x , P(0) is found from Eq. 11.13 
to be: 



P(0) ~ 1 - [1 - (go)] = go 



(11.14) 



Thus, model coefficient q„ can be interpreted as the back- 
ground risk. Hence, the lifetime probability of getting cancer 
from small doses is a linear model given by: 

P(d) = 1 - [1 - (go + gi<i)] = go + q\d = P(0) + q\d 

(11.15) 

In a similar fashion, the multi-stage model of order m takes 
the form: 

P{d)^ 1 -exp(-go+gifi; + g2c;^ + ...gmc;'") (11.16) 



Thus, though the model structure looks empirical superfici- 
ally, grey box models allow some interpretation of the model 
coefficients. Note that the one-hit and the multi-stage models 
are linear only in the low dose region (which is the region to 
which most humans will be exposed to but where no data is 
available) but exponential over the entire range. Figure 11.15 
illustrates how these models (Eqs. 11.13 and 11.16) capture 
measured data, and more importantly that there are several 
orders of magnitude differences in these models when ap- 
plied to the low dosage region. Obviously the multi-stage 
model seems more accurate than the one-hit model as far as 
data fitting is concerned, and one's preference would be to 
select the former. However, there is some measure of uncer- 
tainty in these models when extrapolated downwards, and 
further there is no scientific evidence which indicates that 
one is better than the other in capturing the basic process. 
The U.S. Environmental Protection Agency has deliberately 
chosen to select the one-hit model since it is much more con- 
servative (i.e., predicts higher risk for the same dose value) 
than the other at lower doses. This was a deliberate choice in 
view of the lack of scientific evidence in favor of one over 
the other model. 

This example discussed one type of problem where in- 
verse methods are used for decision making. The application 
of black-box and grey models to the same process was also 
illustrated. Instances when the scientific basis is poor and 
when one has to extrapolate the model well beyond the range 
over which it was developed qualify as one type of ill-de- 
fmed problem. The mathematical form of the dose-response 
curve selected can provide widely different estimates of the 
risk at low doses. The lack of a scientific basis and the need 
to be conservative in drafting associated policy measures led 
to the selection of a model which is probably less accurate in 
how it fits the data collected but is, probably, preferable for 
its final intended purpose. 



1 1 .3.5 State Variable Representation and 
Compartmental Models 

A classification of various dynamic modeling approaches is 
shown in Fig. 11.16. There is an enormous amount of know- 
ledge in this field and on relevant inverse methods, and only 
a basic introduction to a narrow class of models is provided 
here. As described in Sect. 1 .2.4, one differentiates between 
distributed parameter and lumped parameter system models, 
which can be analyzed either in time domain or frequency 
domain. The models, in turn, can be divided into linear and 
non-linear, and then into time continuous or discrete time 
models. This book limits itself to ARM AX or transfer func- 
tion models described in Sect. 9.6.2 which are discrete time 
linear models with constant weight coefficients, and to the 
state variable formulation (described below). 
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Fig. 11.16 Classification of approaches to analyze dynamic models 

Dynamic models are characterized by differential equa- 
tions of higher orders which are difficult to solve. Linear dif- 
ferential equations of order n can be represented by a set of n 
simultaneous first order differential equations. The standard 
nomenclature adopted to represent such systems is shown in 
Fig. 11.17. The system is acted upon by a vector of external 
variables or signals or influences u while y is the output or 
response vector (the system could have several responses). 
The vector x characterizes the state of the internal variables 
or elements of the system that may not be the outputs nor can 
they be measurable. For example, in mechanical systems, 
these internal elements may be positions and velocities of 
separate components of the system, or in thermal RC net- 
works representing a wall, these could be the temperatures of 
the internal nodes within the wall. In many applications, the 
variables x may not have direct physical significance, nor are 
they necessarily unique. 

The state variable representation^ of a multi-input multi- 
output linear system model is framed as: 



X — Ax + Bu 
y = Cx + du 



(11.17) 



where A is called the state matrix, B the input matrix, C the 
output matrix and d the direct transmission term (Franklin et 
al. 1994). Note that the first function relates the state vector 



' Some texts refer to this form as the "state-space" model. Control en- 
gineers retain the distinction in both terms, and refer to state space as 
a specific type of control design technique which is based on the state 
variable model formulation. 



x = f(x,u) 

y = g(x,u) 



Fig. 11.17 Nomenclature adopted in modeling dynamic systems by the 
state variable representation 



at current time with those of its previous values. This can be 
expanded into a linear function of p states and m inputs: 



X, 



anxi + aaX2 + ■ 
+ bi\ux +baU2 



(11.18a) 



I ^im^n 



Note the similarity between this formulation and that of the 
ARMAX model given by Eq. 9.43. Finally, the outputs them- 
selves may or may not be the state variables. Hence, a more 
general representation is to express them as linear algebraic 
combinations: 



y- 



Ci\X\ + Ci2X2 + ■ 

+ di\U\ + di2U2 



I (^ipUp 



(11.18b) 



Compartmental models are a sub-category of the state va- 
riable representation appropriate when a complex process 
can be broken up into simpler discrete sub-systems where 
each can be viewed as homogeneous and well-mixed that 
exchange mass with each other and/or a sink/environment. 
This is a form of discrete linear lumped parameter modeling 
approach more appropriate for time invariant model parame- 
ters which is described in several textbooks (for example, 
Godfrey 1983). Further, several assumptions are inherent 
in this approach: (i) the materials or energy within a com- 
partment get instantly fully-mixed and homogeneous, (ii) 
the exchange rate among compartments are related to the 
concentrations or densities of these compartments, (iii) the 
volumes of the compartments are taken to be constant over 
time, and (iv) usually no chemical reaction is involved as the 
materials flow from one cell to another. The quantity or con- 
centration of material in each compartment can be described 
by first-order (linear or non-linear) constrained differential 
equations, the constraints being that physical quantities such 
as flow rates be non-negative. This type of model has been 
extensively used in such diverse areas as medicine (biomedi- 
cine, pharmacokinetics), science and engineering (ecology, 
environmental engineering, and indoor air quality), and even 
in social sciences. For instance, in a pharmacokinetic model, 
the compartments may represent different sections of a body 
within which the concentration of a drug is assumed to be 
uniform. 

Though these models can be analyzed with time-variant 
coefficients, time invariance is usually assumed. Compart- 
mental models are not appropriate for certain engineering 
applications such as closed-loop control, and even conserva- 
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Room 3 



Room 2 



Room 1 



Outdoors 

Fig. 11.18 A system of three interconnected radial rooms in whicli an 
abrupt contamination release has occurred. A quantity of outdoor air 
r is supplied to Room 1. The temporal variation in the concentration 
levels in the three rooms can be conveniently modeled following the 
compaitmental modeling approach 

tion of momentum equations that are non-compartmental in 
nature. Two or more dependent variables, each a function 
of a single independent variable (usually time for dynamic 
modeling of lumped physical systems) appear in such prob- 
lems which lead to a system of ODEs. 

Consider the three radial-room building as shown in 
Fig. 11.18 with volumes V^, V^ and V^. A certain amount of 
uncontaminated outdoor air (of flow rate r) is brought into 
Room A from where it flows outwards to the other rooms as 
shown. Assume that an amount of contaminant is injected in 
the first room as a single burst which mixes uniformly with 
the air in the first room. The contaminated air then flows to 
the second room, mixes uniformly with the air in the second 
room, and on to the third room from where it escapes to the 
sink (or the outdoors). Let xi(t),X2{t),Xi{t) be the volumet- 
ric concentrations of the contaminant in the three rooms, and 
let ki — r/ Vi . The entire system is modeled by a set of three 
ordinary differential equations (ODEs) as follows: 



Xi — — kiXi 

X2 — kiXi — k2X2 

X3 = k2X2 - kiXi 



(11.19) 



In matrix form, the above set of ODEs can be written as: 



X\ 




-ky 










x\ 


• 
X2 


= 


h 


-k2 







X2 


• 







k2 


-^3 J 




_Xi 



(11.20a) 



or 



X = Ax 



(11.20b) 




The eigenvalue value method of solving equations of this 
type consist in finding values of a scalar, called the eigenva- 
lue X which satisfies the equation 

(A-AI)x = (11.21a) 

where I is the identifty matrix (Edwards and Penney 
1996). For the three-room example, the expanded form of 
Eq. 11.21a is: 

\ /xA /0\ 

-k2-k X2 = (11.21b) 

k2 -h - A/ \xi) \0j 

The eigenvalues are then determined from the following 
equation: 

|A - All = or (-^i - Xi){-k2 - X2){-ki - A3) = 

(11.22) 

The three distinct eigenvalues are the roots of Eq. 11.22; 
namely: Xi — —k\,X2 — —k2,'k^ — —k^. With each ei- 
genvalue is associated an eigenvector v from where the 
general solution can be determined. For example, con- 
sider the case when k|=0.5, k2=0.25 and k^=0.2. Then, 

Xi = -0.5, I2 = -0.25, A3 = -0.2. 

The eigenvector associated with the first eigenvalue is 
found by substituting A by Ai = —0.5 in Eq. 11.21b, to yield 











[A + (0.5)I]v = 


0.5 


0.25 







0.25 0.3 





Vl 




" " 




V2 


= 







. "^3 








(11.23) 

Solving it results in: Vi = [3 -6 5]^ where [...J^ denotes the 
transpose of the vector. A similar approach is followed for 
the two other vectors. The general solution is finally deter- 
mined as: 



x(r) = ci 



r 3 1 




r ° 


-6 


e-°" + C2 


1 


5 




i 



-0.251 



■Ci 



-0.2/ 



(11.24) 



Finally, the three constants are determined from the initial 
conditions. Expanding the above equation results in: 

Xi(0 = 3ci.e-"-5' 

X2(0=-6ci .£-"•=' +C2.e-0-25' (11-25) 

X3(0 = 5ci.e-"-5' - 5c2.e-"-25' + c^.e^^-^' 

Let the initial concentration levels in the three rooms be 
jci(O) = 900, X2(0) = 0, JC3(0) = . Inserting these in 
Eq. 11.25, and solving them yields c^ = 30, c^ = 1800, c^ = 
7500. Finally, the equations for the concentrations in the 
three rooms are given by the following tri-exponential solu- 
tions (plotted in Fig. 11.19): 
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Fig. 11.19 Variation in the 
concentrations over time for tlie 
tliree interconnected rooms mo- 
deled as compartmental models. 
(Following Eq. 11.26) 
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XI (0 = 900e-''-5' 

X2(0 = -1800e-°-5' + ISOOe-''-^^' 

X3(0 = 1500e-°-5' - 9000e-0-25' + 



(11.26) 



7500e 



-0.2/ 



The application of an inverse modeling approach to this pro- 
blem can take several forms depending on the intent in deve- 
loping the model. The basic premise is that such a building 
with three inter-connected rooms exists in reality from which 
actual concentration measurements can be taken. If an actual 
test similar to that assumed above were to be carried out, 
where should the sensors be placed (in all three rooms, or 
would placing sensors in Rooms 1 and 3 suffice), and what 
should be their sensitivities and response times? One has to 
account for sensor inaccuracies or even drifts, and so would 
some manner of fusing or combining all three data stream 
result in more robust model identification? Can a set of mo- 
dels identified under one test condition be accurate enough 
to predict dynamic behavior in the three rooms under other 
contaminant releases? What should be the sampling frequen- 
cy? While longer time intervals may be adequate for routine 
measurements, the estimation would be better if high fre- 
quency samples were available immediately after a contami- 
nant release was detected. Such practical issues are surveyed 
in the next section. 



1 1 .3.6 Practical Identifiability Issues 

The complete identification problem consists of selecting 
an appropriate model, and then estimating the parameters of 
the matrices {A, B, C} in Eq. 11.17. The concepts of struc- 
tural and numerical identifiability were also introduced in 
Sects. 10.2.2 and 10.2.3 respectively. Structural identifiabi- 



15 
Time (minutes) 



20 



25 



30 



lity, also called "deterministic identifiability", is concerned 
with arriving at performance models of the system under 
perfect or noise free observations are available. If it is found 
that parameters of the assumed model structure cannot be 
uniquely identified, either the model has to be reformulated 
or else additional measurements made. The latter may in- 
volve observing more or different states and/or judiciously 
perturbing the system with additional inputs. There is a vast 
body of knowledge pertinent to different disciplines as to the 
design of optimal input signals; for example, Sinha and Ku- 
szta (1983) vis-a-vis control systems, and Godfrey (1983) 
for applications involving compartmental models in general 
(while Evans 1996 limits himself to their use for indoor air 
quality modeling). 

The problem of numerical identifiability is related to the 
quality of the input-output data gathered, i.e., due to noise in 
the data and due to ill-conditioning of the correlation coef- 
ficient matrix (see Sect. 10.2.3). Godfrey (1983) discusses 
several pitfalls of compartmental modeling, one of which is 
the fact that difference in the sum of squares between different 
possible models tends to reduce as the noise in the signal in- 
creases. He also addresses the effects of limited range of data 
collected (neglecting slow transient or early termination of 
sampling or rapid transients; delayed start of sampling which 
can miss the initial spikes and limits the ability to extrapolate 
models back to the zero-time intercept values); effect of poor- 
ly spaced samples, and the effect of short samples. The three- 
room example will be used to illustrate some of these concepts 
in a somewhat ad hoc manner which the reader can emulate 
on his own, and enhance with different allied investigations. 

(a) Effect of Noise in the Data The dynamic performance 
of the three rooms is given by Eq. 1 1 .26. These models have 
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been used to generate a sequence of 30 data points at one-mi- 
nute intervals for each room. It is left to the reader to perform 
a reality check and verify that the correct values of the model 
parameters can be re-identified when no noise is introduced 
in the data. A better representation of reality is to introduce 
noise in the samples. Several amounts of noise and different 
types of distribution can be studied. Here, a sequence of nor- 
mally distributed random noise with zero bias and standard 
deviation of 20, i.e., e(0, 20) have been generated to corrupt 
the simulated data sample. This is quite a small instrument 
noise considering that the maximum concentration to be read 
is 900 ppm. This data has been used to identify the parame- 
ters assuming that the system models are known to be the 
same tri-exponential equations. One would obtain slightly 
different values depending on the random noise introduced, 
and several runs would yield a more realistic evaluation of 
the uncertainty in the parameters estimated (this is the Monte 
Carlo simulation as applied to regression analysis). Howe- 
ver, the results of only one trial are shown under column (a) 
in Table. 11.8. Note that though the model R^ is excellent and 
the dynamic prediction of the models captures the "actual" 
behavior quite well (see Fig. 1 1.20), the parameters are quite 
different from the correct values, with the differences being 
room-specific. For example, the coefficients "a and c" for 
Room 3 are poorly reproduced probably because six para- 
meters are being identified with only 30 data points. Fur- 
ther, this is a non-linear regression problem and good (i.e., 
reasonably close) starting values have to be provided to the 
statistical package. The selection of these starting values also 



Table 1 1 .8 Results of parameter estimation for two cases of the three- 
room problem with the simulation data comapted by nomially distribu- 
ted random noise e(0, 20) 



Model 
parameters 
(Eq. 11.26) 



Correct 
values 
(Eq. 11.26) 



(a) 
With 

sampling at 
one minute 
intervals 



(b) 

With high fre- 
quency sam- 
pling at 0. 1 min 
intervals for first 
10 min 



Room 1 


a 




900 


787.0 


900.1 




b 




-0.50 


-0.439 


-0.500 




Adj. 


R2 




96.6% 


99.2% 


Room 2 


a 




-1800 


-1221.1 


-1922.2 




b 




-0.50 


-0.707 


-0.483 




c 




1800 


1118.8 


1931.3 




d 




-0.25 


-0.207 


-0.254 




Adj. 


R2 




98.0% 


98.4% 


Room 3 


a 




1500 


3961.9 


2706.1 




b 




-0.50 


-0.407 


-0.436 




c 




-9000 


-11215.8 


-8134.7 




d 




-0.25 


-0.280 


-0.276 




e 




7500 


7239.0 


5433.1 




f 




-0.20 


-0.208 


-0.196 




Adj. 


R- 
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96.9% 
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Fig. 11.20 Time series plots of measurements with random noise 
and identified models using Eq. 11.26 (case (a) of Table. 11.8). a First 
room, b Second room, c Third room 

affects the parameter estimation, and hence the need to per- 
form numerous tests with different starting values. 

(b) Effect of Increasing Sampling Frequency Note that 
the concentration in Room 1 at time t=0 predicted by the 
identified model is 787.0 which is quite different than the 
correct value of 900 ppm. The differences in the two other 
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Fig. 11.21 Plot to illustrate how model parameter identification is improved if the sampling rate is increased to 0.1 min during the first 10 min 
when dynamic transients are pronounced, a First room, b Second room, c Third room 



rooms are also quite large. This is a manifestation of poor- 
ly spaced sampling. It is intuitive to assume that parameter 
estimation would improve if more samples were collected 
during the period when the dynamic transients were large. 
This effect can be evaluated by assuming that samples were 
collected at 0.1 min for the first 10 min of the data collection 
period of 30 min. Note from Table. 11.8 that the parame- 



ter estimation for all rooms has improved greatly. The two 
parameters for Room 1 are almost perfect, while they are 
very good for Room 2. The exponent parameters for Room 
3 (parameters b, d and f) are quite good while the amplitude 
parameters "a, c and e" for Room 2 are still biased. How 
well the models fit the data with almost no patterned residual 
behavior can be noted from Fig. 1 1.21. 
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200 
Predicted 

Fig. 11.22 Plot to illustrate patterned residual behavior when a lower order exponential model is fit to observed data. In this case, a two exponen- 
tial model was fit to the data generated from the third compartment with random noise added. The adjusted R-square was 0.90. a Time series plot. 
b Observed versus predicted plot 



(c) Effect of Fitting Lower Order Models How would se- 
lecting a lower order model affect the results? This aspect 
relates to system identification and not to parameter estima- 
tion. The simple case of regressing a bi-exponential model to 
the "observed" sample of Room 3 is illustrated in Fig. 1 1.22. 
In this case, there is a distinct pattern in the residual behavior 
which is unmistakable. In actual situations, this is a much 
more difficult issue. Godfrey (1983) suggests that, in most 
practical instances, one should not use more than three or 
four compartments. The sum of exponential models which 
the compartmental approach yields would require non-linear 
estimation which along with the dynamic response transients 
and sampling errors make robust system identification quite 
difficult, if not impossible. 



11.4 Closure 

1 1 .4.1 Curve Fitting Versus Parameter 
Estimation 

The terms "regression analysis" and "parameter estimation" 
can be viewed as synonymous, with statisticians favoring the 
former term, and scientists and engineers the latter. A com- 
mon confusion is the difference between parameter estima- 
tion and "curve fitting". Curve fitting procedures are charac- 
terized by two degrees of arbitrariness (Bard 1974): 

(a) the class of functions used is arbitrary and dictated more 
by convenience (such as using a linear model) than by 
the physical nature of the process generating the data; 

(b) the best fit criterion is somewhat arbitrary and statisti- 
cally unsophisticated (such as making inferences from 
least square regression results when it is strictly invalid 
to do so). 

Thus, curve fitting implies a black-box approach useful 
for summarizing data and for interpolation. It should not be 



used for extrapolation. The equations and parameters deter- 
mined from curve fitting provide little insight into the nature 
of the process. However, this approach is often adequate and 
suitable for many applications involving very complex phe- 
nomenon, and/or when the project budget allows for limited 
monitoring and analysis effort. Parameter estimation, on the 
other hand, is a more formalized approach which includes 
curve fitting at its simplest form. Grey box models, i.e., 
model structures derived from theoretical considerations are 
also an important sub-category of parameter estimation and 
can, moreover, include previous knowledge concerning the 
values of the parameters as well as the statistical nature of 
the measurement errors. Some professionals use the word 
model fitting in an attempt to distinguish it from curve fit- 
ting. Thus, in the framework of grey box inverse models, one 
attempts to identify the parameters in the model such that 
their physical significance is retained, thereby allowing bet- 
ter understanding of the basic physical processes involved. 
For example, parameters can be associated with fluid or me- 
chanical properties. Some of the several practical advantages 
of using grey box models in the framework of the param- 
eter estimation approach are that they provide finer control 
of the process, better prediction of system performance, and 
more robust on-line diagnostics and fault detection than one 
involving black-box models. The statistical process of de- 
termining best parameter values in the face of unavoidable 
measurement errors and imperfect experimental design is 
called parameter estimation. 



1 1 .4.2 Non-intrusive Data Collection 

As stated in Sect. 6.1, one distinguishes between experimen- 
tal design methods which can be performed in a laboratory- 
setting or under controlled conditions so as to identify models 
and parameters as robustly as possible, and observational or 
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non-intrusive data collection. The latter is associated with 
systems under their normal operation and subject to whate- 
ver random and natural stimuli that perturb them. Three rele- 
vant examples of natural systems are the periodic behavior of 
predator-prey populations in a closed environment, the tem- 
poral changes in sunspot activity, and the seasonal changes 
in river water flow. One can distinguishes between two types 
of identification techniques relevant to non-intrusive data: 
(i) off-line or batch identification where data collection and 
model identification are successive, i.e., data collection 
is done first, and the model identified later using the 
entire data stream. There are different ways of proces- 
sing the data, from the simplest conventional ordinary 
least squares technique to more advanced ones (some 
of which are described in Chap. 10 and in this chap- 
ter); 
(ii) on-line or real-time identification (also called adaptive 
estimation, or recursive identification) where the mo- 
del parameters are identified in a continuous or semi- 
continuous manner during the operation of the system. 
Recursive identification algorithms are those where one 
normally processes the data several times so as to gra- 
dually improve the accuracy of the estimates. The basic 
distinction between on-line and off-line identification is 
that in the former case the new data is used to correct or 
update the existing estimates without having to re-ana- 
lyze the data set from the beginning. 
There are two disadvantages to on-line identification 
in contrast to off-line identification. One is that the model 
structure needs to be known before identification is initi- 
ated, while in the off-line situation different types of models 
can be tried out. The second disadvantage is that, with few 
exceptions, on-line identification, does not yield parameter 
estimates as accurate as do off-line methods (specially with 
relatively short data records). However, there are also ad- 
vantages in using on-line identification. One can discard 
old data and only retain the model parameters. Another ad- 
vantage is that corrective action, if necessary, can be taken 
in real time. On-line identification is, thus, of critical im- 
portance in certain fields involving fast reaction via adap- 
tive control of critical systems, digital telecommunication 
and signal-processing. The interested reader can refer to 
numerous textbooks on this subject (for example, Sinha and 
Kuszta 1983). 



cording analog or digital data from sensors in the form of 
numeric streams of data. Nowadays, sensory data streams 
from multiple and disparate sources (such as video, radar, 
sonar, vibration,...) are easy and cheap to obtain. It is no 
surprise, then, that they are used to supplement data from 
traditional monitoring systems resulting in data that is more 
comprehensive, and which allow more robust decisions to 
be made. Such multi-sensor and mixed-mode data streams 
have led to a discipline called data fusion, defined as the 
use of techniques that combine data from multiple sources in 
order to achieve inferences, which will be more robust than 
if they were achieved by means of a single source. A good 
review of data fusion models and architectures is provided 
by Esteban et al. (2005). Application areas such as space, 
robotics, medicine, sensor networks have seen a plethora of 
allied data fusion methods aimed at detection, recognition, 
identification, tracking, change detection, decision making, 
etc. These techniques are generally studied under signal pro- 
cessing, sensor networks, data mining, and engineering de- 
cision making. 

With engineered systems getting to be increasingly com- 
plex, several innovative means have been developed to check 
the sensing hardware itself. One such technique is functional 
testing; a term often attributed to software program testing 
with the purpose of ensuring that the program works the way 
it was intended while being in conformance with the relevant 
industry standards. It is also being used in the context of test- 
ing the proper functioning of engineered systems which are 
controlled by embedded sensors. For example, consider a 
piece of control hardware whose purpose is to change the 
operating conditions of a chemical digester (say, the temper- 
ature or the concentration of the mix). Whether the control 
hardware is operating properly or not, can be evaluated by in- 
tentionally sending test signals to it at some pre-determined 
frequency and duration, and then analyzing the correspond- 
ing digester output to determine satisfactory performance. If 
it is not, then the operator is alerted and corrective action to 
recalibrate the control hardware may be warranted. Such ap- 
proaches are widely used in industrial systems, and have also 
been proposed and demonstrated for building energy system 
controls. 



Problems 



1 1 .4.3 Data Fusion and Functional Testing 

The techniques and concepts presented in this chapter are 
several decades old, and though quite basic, are still relevant 
since they provide the basic foundation to more advanced 
and recent methods. Performance of engineered systems 
was traditionally measured by sampling, averaging and re- 



Pr. 11.1 Consider the data in Pr. 10.14 where the Wind Chill 
(WC) factor is tabulated for different values of ambient tem- 
perature and wind velocity. Use this data to evaluate different 
multi-layer perceptron (MLP) architectures assuming one 
hidden layer only. You will use 50% of the data for training, 
25% as validation and 25% as testing. Compare model good- 
ness of fit and residual behavior of the best identified MLP 
with the regression models identified in Pr. 10.14. 
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Fig. 11.23 A building with inter- 
connected and leaky rooms 
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Pr. 11.2 Consider Fig. 1.5c showing the n* order model 
thermal network for heat conduction through a wall. The 
internal node temperatures T^ are the state variables whose 
values are internal to the system, and have to be expressed 
in terms of the outdoor temperature (variable u) and indoor 
air temperatures (variable y) and the individual resistors and 
capacitors of the system (parameters of the system). You will 
assume the following two networks, write the model equa- 
tions for the temperature T at each internal node, reframe 
them in the state variable formulation (Eq. 11.17) and iden- 
tify the elements of the A and B matrices in terms of the 
resistances and the capacitors: 

(a) network with 2 capacitors and 3 resistors (3R2C) 

(b) network with 3 capacitors and 4 resistors (4R3C). 

Pr. 11.3 You will repeat the analysis of the three room com- 
partmental problem presented in Sect. 11.3.5 for the confi- 
guration shown in Fig. 1 1.23. 

(a) Follow the same procedure to derive the exponential 
equations for the dynamic response of each of the three 
rooms assuming the same initial conditions, namely: 
x,(0) = 900, ^^(0) = 0, ^^(O) = 

(b) You will now use these models to generate synthetic 
data with random noise and perform similar analysis as 
shown in Table. 11.8. Discuss your results. 
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Risk Analysis and Decision-Making 
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As stated in Chap. 1 , inverse modeling is not an end by itself 
but a precursor to model building needed for either better 
understanding of the process or for decision-making that re- 
sults in some type of action. Decision theory is the study of 
methods for arriving at rational decisions under uncertainty. 
This chapter provides an overview of quantitative decision- 
making methods under uncertainty along with a description 
of the various phases involved. The decisions themselves 
may or may not prove to be correct in the long term, but 
the scientific community has reached a general consensus 
on the methodology to address such problems. Different 
sources of uncertainty are described, and an overall frame- 
work along with various elements of decision-making under 
uncertainty is presented. How Bayes' approach can play an 
important role in decision-making is also briefly presented. 
Decision problems are clouded by uncertainty, and/or by ha- 
ving to meet multiple objectives; this chapter also presents 
basic concepts of multi-attribute modeling. The role of risk 
analysis and its various aspects are covered along with appli- 
cations from various fields. General principles are presented 
in somewhat idealized situations for better understanding, 
while several illustrative case studies are meant to provide 
exposure to real-world situations. The topics which relate to 
decision analysis are numerous, and at varying levels of ma- 
turity; only a basic overview of this vast body of knowledge 
has been provided in this chapter. 



12.1 Background 

1 2.1 .1 Types of Decision-Making Problems 
and Applications 

Decision-making is pervasive in real life; one makes nume- 
rous mundane decisions daily from what set of clothes to 
wear in the morning to what type of cuisine to eat for lunch. 
However, what one is interested here is the formal manner 
in which one goes about tackling more complex engineering 
and scientific problems involving several alternatives with 



bigger payoffs and penalties. It is assumed that the reader is 
familiar with basic notions taught in undergraduate classes 
on economic analysis of alternatives involving time value of 
money, simple payback analysis as well as present worth cash 
flow approaches. As stated in Sect. 1.6, decision theory is the 
study of methods for arriving at "rational" decisions under 
uncertainty. This theory, albeit in slightly different forms, ap- 
plies to a broad range of applications covering engineering, 
health, manufacturing, business, finance, and government. 

One can distinguish between two types of problems de- 
pending on the types of variables under study, 
(i) The perhaps simpler application is the continuous case 
such as controlling an engineered existing system or 
operating it in a certain manner so as to minimize a cost 
or utility function subject to constraints. Such problems 
involve optimization of one (or several) continuous va- 
riables, 
(ii) The discrete case such as evaluating different alternati- 
ves in order to select a course of action. This deals with 
a plethora of problems faced in numerous disciplines. 
In engineering, the explicit consideration of numerous 
choices in terms of design (selection of materials of 
components to how the components are assembled into 
systems) is a simple example. One could also consider 
decision analysis involving after-the-fact intervention 
where the analysis is enhanced by collecting data from 
the system while in operation, and subsequent analysis 
reveals that the design erred in some of its assumptions 
and, hence, discrete changes have to be made to the sys- 
tem or its operation. In environmental science such as 
high pollution in a certain area, one can visualize rea- 
ching decisions involving both engineering mitigation 
measures as well as policy decisions to remedy the si- 
tuation and prevent future occurrences. 
The decisions themselves may or may not prove to be cor- 
rect in the long term, but the process provides a structure for 
the overall methodology. It is under situations when uncertain- 
ties dominate that this framework assumes crucial importance. 
Certain aspects of uncertainty have previously been discussed 
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in this book. In Sect. 1.5, four different types of data uncer- 
tainty were described which relate to data types classified as 
qualitative, ordinal, metric and count (see Sect. 1.2.1). Propa- 
gation of measurement errors associated with metric data was 
addressed in Sects. 3.6 and 3.7. Uncertainties in regression 
model parameters when using least squares estimation and 
the resulting model prediction uncertainties were presented in 
Sects. 5.3.2, 5.3.4 and 5.5.2. Here, one deals with uncertainty 
in a larger system context, involving all the above aspects in 
some form or another, as well as other issues such as the va- 
rious chance events and outcomes. A common categorization 
of uncertainty at the systems level involves: 

(a) Structural deficiency, which, for the continuous variab- 
le problem, is due to an incomplete or improper formu- 
lation of the objective function model (such as use of 
improper surrogate variables, excluded influential va- 
riables, improper functional form) or of the constraints. 
For the discrete case, this could be due to overlooking 
some of the possible outcomes or chance nodes. This is 
akin to improper model or system identification. 

(b) Uncertainty in the parameters of the model, which ma- 
nifests itself by biased model predictions; this is akin to 
improper parameter estimation. 

(c) Stochasticity or inherent ambiguity which no amount of 
additional measurements can reduce. This stochasticity 
can appear in the model parameters, or in the input for- 
cing functions or chance events of the system, or in the 
behavior of the system outcomes, or in the vagueness 
associated with the risk attitudes of the user (one could 
adopt either fuzzy logic or follow the more traditional 
probabilistic approach). Thus, stochasticity can be vie- 
wed as a type of uncertainty (note that some professio- 
nals adopt the view that it is distinct from uncertainty 
and treat it as such). 

The first two sources above can be grouped under episte- 
mic or ignorance or lack of complete knowledge. This can be 
reduced as more metric data is acquired; the model structure 
is improved and the parameters identified more accurately. 
The third source is referred to as aleotory uncertainty and 
there are well accepted ways of analyzing such situations. 
The above considerations provide one way of classifying dif- 
ferent decision-making problems (Fig. 12.1): 
(a) Problems involving no or low aleatory and epistemic 
uncertainty: In such cases, the need for careful analy- 
sis stems from the fact that the problem to be solved is 
mathematically complex with various possible options, 
but the model structure itself is well-defined, i.e., robust, 
accurate and complete. The effect of uncertainty in the 
specification of the model parameters and/or inaccura- 
te measurements in model inputs has a relatively minor 
impact on the model results. This is treated under mat- 
hematical optimization in traditional engineering and 
operations research disciplines, and most undergradua- 
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Fig. 12.1 Overview of different types of problem.s addressed by de- 
cision analysis and treated in tliis cliapter along with section numbers. 
Traditional optimization corresponds to decision-making under low 
epistemic and aleatory uncertainties while the other cases can be vie- 
wed as decision-making under different forms of uncertainty 



tes are exposed to such techniques. Such problems fall 
under decision-making under certainty, and an overview 
has already been provided in Sect. 7.2. An illustrative 
example of how decision-making may play an import- 
ant role even in such situations is given in Sect. 12.4.2. 

(b) Problems involving low epistemic but high aleatory un- 
certainty: Queuing models studied in operations rese- 
arch fall in this category. Here, the arrival of customers 
who join queues for servicing is modeled as random 
variables while the time taken for servicing a customer 
may also be modeled as another random variable. De- 
termining the number of service counters to keep open 
so that the mean waiting time for customers is less than 
a stipulated value is a typical example of such problems. 
This instance is discussed in Sect. 12.2.7. 

(c) Problems involving high epistemic but low aleatory un- 
certainty: Cases falling in this category arise when the 
model structure is empirical or heuristic (such as logis- 
tic functions meant to model dose-response curves di- 
scussed in Sect. 10.4.4.), but the outcomes themselves 
are not complete and/or not well-defined. The occur- 
rence of the events or inputs to the system may be pro- 
babilistic, but these are considered known in advance in 
statistical terms. 

(d) Problems involving both high epistemic and high alea- 
tory uncertainties: Such cases are the most difficult to 
deal with, and different analysts can reach widely dif- 
ferent conclusions. Examples include societal costs due 
to air pollution from fossil-based power plants or socie- 
tal cost of a major earthquake hitting California. The 
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treatment of such cases is very challenging and com- 
plex, and can be found in research papers (for exam- 
ple, Rabl 2005). A case study example is provided in 
Sect. 12.4.1. 
(e) Problems involving competitors with similar objectives: 
Such cases fall under gaming problems or decision-ma- 
king under situations where the multiple entities have 
potentially conflicting interests. There are two types of 
games: under conflict, i.e., the various competitors are 
trying to win the "game" at the expense of the others 
based on their selfish needs (called a non-cooperative 
situation to which the Nash equilibrium applies), or try- 
ing to reach a mutually-acceptable solution (called a co- 
operative situation) which maximizes the benefit to the 
whole group. There is a rich body of knowledge in this 
area and has great implications in international trade, 
business and warfare, but much less so in engineering 
and science applications; hence, these are not treated in 
this book. 
All the above approaches (except the first) involve ex- 
plicit consideration of the uncertainty during the process 
of reaching a solution, as against performing an uncertain- 
ty analysis after a solution or a decision has been made (as 
adopted during an engineering sensitivity analysis). These 
approaches require adopting formal risk analysis techniques 
whereby all uncertain and undesirable events are framed as 
risks, their probabilities of occurrence selected, the chain of 
events simplified and modeled, the gains and losses asso- 
ciated with different options computed, trade-offs between 
competing alternatives assessed, and the risk attitude of the 
decision-maker captured (Clemen and Reilly 2001). 

Given its importance when faced with high epistemic 
uncertainty situations, risk analysis methods are discussed 
in Sect. 12.3. A simple way of differentiating between risk 
analysis and decision-making is to view the former as event- 
focused (what can go wrong? how likely is it? what are the 
consequences?), and mostly quantitative and empirical in 
nature, while decision-making is more comprehensive and 
includes, in addition, issues that are more qualitative and 
normative such as uncertain knowledge and uncertain future 
(Haimes 2004). Assuming different risk models can lead to 
different decisions. It is clear that risk analysis is a subset, 
albeit an important one, of the decision-making process. It is 
important to keep in mind that the real issue is not whether 
the uncertainty is large or not, but whether this may cause an 
improper or different decision to be made than the best one. 



1 2.1 .2 Engineering Decisions Involving 
Discrete Alternatives 

This sub-section will present one example involving engi- 
neering decisions of selecting a piece of equipment among 



Table 1 2.1 Problem specification for Example 
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Pump A 


PumpB 


Purchase price ($) 4,000 


6,000 


Efficiency 80% 


92% 


Annual maintenance ($/year) 200 


450 


Life (years) 4 


6 



competing possibilities. This is a simple context-setting re- 
view example of optimization problems with low epistemic 
uncertainty. 

Example 12.1.1: Simple example of an engineering deci- 
sion involving equipment selection 

A designer is faced with the problem of selecting the bet- 
ter of two pumps of 2 kW each meant to operate 6,000 h/year 
with the attributes assembled in Table 12.1. 

If unit price of electricity is $ 0.10/kWh, which pump is 
the better selection if time value of money is not considered? 
Note that no uncertainty as such is stated in the problem. 

Annual operating expense 

for Pump A = (2 kW/0.80) x (6,000 h/year) 

X ($ 0.1 /kWh) -I- $ 200 = $ 1,700/year 

for Pump B = (2 kW/0.92) x (6,000 h/year) 

X ($ 0.1/kWh) -I- $ 450 = $l,750/year. 

Since both pumps have the same purchase price per year of 
life ($ 1,000/year), one would select the less efficient Pump 
A since it has a slightly lower annual operating cost. ■ 

The two options are sufficiently close that a sensitivity 
analysis is warranted. The traditional process adopted in en- 
gineering analysis is to identify the important parameters (in 
this problem, there are no "variables" as such) either by a 
formal sensitivity analysis or by physical intuition, assign a 
range of variability to the base values (say, the uncertain- 
ty expressed as a standard deviation), and look at the resul- 
ting changes in the final operating costs. For example, in the 
simple case treated above, the designer comes to know from 
another professional that the efficiency of Pump A is not 80% 
as claimed but closer to 75%. If this value was assumed, the 
annual operating expense for Pump A would turn out to be 
$ 1,800/year, and the designer would lean towards selecting 
Pump B instead. Obviously, other quantities assumed would 
also have some uncertainty, and evaluating the alternatives 
under the combined influence of all these uncertainties is a 
more realistic and complex problem than the simplistic one 
assumed above. If the uncertainties around the parameters 
are relatively small (say, in terms of the coefficient of variati- 
on CV, i.e., the uncertainty normalized by the mean value of 
around 5-10%), the problem would be treated as a low alea- 
tory problem. Techniques relevant to this case are those stu- 
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died under traditional optimization (discussed in Sect. 7.2). 
However, if the uncertainties are large (say, CV values of 
20-50%) such as in many health and environmental issues, 
the problem would be treated as a high aleatory problem re- 
quiring a more formal probabilistic approach involving, say, 
Monte Carlo methods (this is treated in Sect. 12.2.7). 

Formal analysis methods of the more complex classes (c) 
and (d) involving both aleatory and epistemic uncertainty 
are illustrated by way of examples in Sects. 12.2.2, 12.2.7 
and 12.4.1. The simple pump selection example can be used 
to illustrate the type of considerations which would requi- 
re such an approach. For example, the uncertainties of the 
parameters assumed in case (b) above are taken to be due 
to the randomness of the variables and not due to a lack of 
knowledge which additional equipment or system perfor- 
mance measurements can reduce. Thus, the issue of struc- 
tural uncertainty did not arise. Say, the problem is framed as 
one where the amount of water to be pumped is not a fixed 
demand requiring the pumps to consume their rated 2 kW at 
all times, but a variable quantity of water which depends on 
say, the outdoor temperature (the hotter it is, more water is 
required, for example, to irrigate an agricultural field). One 
now has two models, one for predicting the water demand 
against temperature, and another for the power consumed by 
the pumps under variable flow (since the efficiency is not 
constant at part load performance). The regression models 
are unlikely to be perfect (i.e., have R^ of 100%) due to over- 
looking other variables which also dictate water demand, 
and may be biased due to inadequate data point coverage of 
the entire performance map. If the models are excellent (say 
R^ = 95%), then the structural uncertainty is unlikely to be 
important, and whatever small structural uncertainties there 
may exist are clubbed with the random or aleatory uncertain- 
ties and treated together. However, if the models are poor, 
then the structural deficiencies would influence our analysis 
in a fundamentally different manner from the pure aleatory 
problem, and then a formal approach for case (c) would be 
warranted. 



1 2.2 Decision-Making Under Uncertainty 

12.2.1 General Framework 

Unlike decision-making under certainty (discussed in 
Sect. 7.2 under traditional optimization), decision-making 
under uncertainty (or under risk) applies to situations when 
a course of action is to be selected when the results of each 
alternative course of action or outcomes of different possible 
chance events are not fully known deterministically. In such 
cases, the process of decision-making with discrete alternati- 
ves can be framed as a series of steps described below (Cle- 
men and Reilly 2001): 



(a) Frame the decision problem and pertinent objectives in 

a qualitative manner 

This is often not as obvious as it seems. One may unin- 
tentionally be framing the "wrong problem" perhaps 
because the objectives were not clear at the onset. For 
example, the industrialist planning on building a photo- 
voltaic (PV) module assembly factory at a particular 
location naturally wishes to maximize profit. However, 
he also considers a second attribute, namely creation of 
as many new jobs as possible in the factory. However, it 
could be that the real intent was to create prosperity in 
the region and not by direct hires. It is very likely that 
the subsequent decisions would be quite different. Hen- 
ce, articulating and then framing the objective variable 
or function so that the real intent is captured is a critical 
first step. This phase, often referred to as "getting the 
context right" may require several iterations in cases 
when the decision-maker in unclear as to his inmost ob- 
jectives or modifies them during the course of this step. 

(b) Identify decision alternatives or actions 

This step would involve carefully identifying the vari- 
ous choices or alternatives or actions and characterizing 
them. For example, in the PV module factory instance, 
one can view the decision alternatives as technical and 
economic. The former would include the selection of 
the type of PV cell technology (such as say, single crys- 
talline silicon or thin film), and the module size or pow- 
er rating and voltage output, . . . which would maximize 
sales based on certain types of anticipated applications. 
Actions involving economic factors could involve selec- 
ting the capacity of the production line, and/or possible 
location (say, whether Malaysia or Vietnam or China). 

(c) Identify and quantify chance events 

This step involves identifying unexpected factors which 
could affect the alternatives selected in step (b): eco- 
nomy turning sour, a breakthrough in a new type of 
PV cell, government withdrawing/reducing rebates to 
solar installations, ..The chance events are probabilistic 
and have to be framed as either discrete or continuous. 
Continuous probability distributions (either objective or 
subjective) are often approximated by discrete ones for 
simpler treatment. This is discussed in Sect. 12.2.3. The 
chance events must be collectively exhaustive (i.e., in- 
clude all possible situations), and be mutually exclusive 
(i.e., only one of the outcomes can happen). The term 
"states of nature" is also widely used synonymously to 
chance events. 

(d) Assemble the entire decision problem 

The interactions of the decomposed smaller (and, thus, 
more manageable) pieces of the problem are represen- 
ted by influence diagrams and decision trees which pro- 
vide a clear overall depiction of the structure of the ent- 
ire problem. This aspect is discussed in Sect. 12.2.2. 
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(e) Develop mathematical representations or models 
Here the outcomes (or consequences or payoffs) of each 
chance event and action are considered, and a structure 
to the problem is provided by framing the correspon- 
ding mathematical models along with constraints. 

(f) Identify sources and characterize magnitude of uncer- 
tainties 

The issues relevant to this aspect can arise from steps (c)- 
(e) and have been previously described in Sect. 12.1.1. 

(g) Model user preferences and risk attitudes 

Unlike traditional optimization problems, decisions 
are often not taken based simply on maximizing an 
objective function. Except for the simple types of pro- 
blems, alternative or competing decisions have different 
amounts of risk. Different people have different risk at- 
titudes, and thus are willing to accept different levels 
of risk. The same person may also have a different 
risk attitude under different situations. Thus, the utili- 
ty function represents a way of mapping different units 
(for example, operating cost in dollars) into a "utility 
value" number more meaningful to the individual. How 
to model utility functions is discussed in Sect. 12.2.5. 
Further, as discussed earlier in step (a), there may be 
more than one objective on which the decision is to be 
made. How to consider multi-attributes and develop an 
objective function for the whole problem is an integral 
part of this step, and is also described in Sect. 12.2.6. 
Clearly, this step is an important one requiring a lot of 
skill, effort and commitment. 

(h) Assemble the objective function and solve the complete 
model along with uncertainty bounds 
The solution or the outcome of the entire model under 
low uncertainty would involve the types of continuous 
optimization techniques described in Sect. 7.2, as well 
as discrete ones illustrated through Example 12.1.1. 
However, when uncertainty plays a major role in the 
problem definition, it will have a dominant effect on 
the solution as well. One can no longer separate the 
solution of the model (objective function plus risk at- 
titude) from its variability. Sensitivity or post-optimal 
analysis, so useful for low uncertainty problems, is to 
be supplemented by generating an uncertainty distribu- 
tion about optimal solution. The latter can be achieved 
by Monte Carlo methods (described and illustrated in 
Sects. 12.2.7 and 12.4.2). 

(i) Perform post-optimal analysis and select action to im- 
plement 

The last step may involve certain types of post-optimal 
analysis before selecting on a course of action. Such 
analysis could involve reappraising certain factors such 
as chance events and their probability of occurrence, va- 
riants to risk attitudes, inclusion of more alternatives, . . . 
The concept of indifference is often used (illustrated in 



Sect. 12.4.2). Hence, this last step would involve rei- 
terations till a preferred and satisfactory alternative or 
course of action is decided upon. 



1 2.2.2 Modeling Problem Structure Using 

Influence Diagrams and Decision Trees 

This section describes issues pertinent to step (d) above in- 
volving structuring the problem which is done once the ob- 
jectives, decision alternatives and chance events have been 
identified. Two types of representations are used: influence 
diagrams and decision trees. An influence diagram is a simp- 
le graphical representation of a decision analysis problem 
which captures the decision-maker's current state of know- 
ledge in a compact and effective manner. For example, the 
commercialization process of a new product launch invol- 
ving essential elements such as decisions, uncertainties, and 
objectives as well as their influence on each other is captured 
by the influence diagram shown in Fig. 12.2. Different nodes 
of different shapes correspond to different types of variab- 
les; the convention is to represent decision nodes as squares, 
chance nodes as circles and payoff nodes as diamonds (or as 
triangles). Arrows (or arcs) define their relationship to each 
other. The convention when creating influence diagrams is to 
follow certain guidelines: (i) use only one payoff node, (ii) 
do not use cycles or loops, and (iii) avoid using barren nodes 
which do not lead to some other node (except when such a 
node makes the representation much more understandable). 
Influence diagrams are especially useful for communica- 
ting a decision model to others and creating an overview of 
a complex decision problem. However, a drawback to influ- 
ence diagrams is that their abstraction hides many details. It 
is difficult to see what possible outcomes are associated with 
an event or decision as many outcomes can be embedded in 
a single influence diagram decision or chance node. Often, 



Fund 
R&D 




Fig. 12.2 Influence diagram for the process of commercialization of 
a product with the objective of achieving a sizeable market share. By 
convention, decision nodes are represented by squares, chance nodes 
by circles and payoff nodes by diamonds (or as triangles), while arrows 
(or arcs) define their relationship to each other 
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it is also not possible to infer the chronological sequence of 
events appearing during the decision-making process. 

Decision trees (also called decision flow networks or 
decision diagrams), as opposed to influence diagrams, in- 
clude all possible decision options and chance events with 
a branching structure. They proceed chronologically, left to 
right, showing events and decisions as they occur in time. 
All options, outcomes and payoffs, along with the values and 
probabilities associated with them, are depicted directly with 
little ambiguity as to the possible outcomes. Decision trees 
are popular because they provide a visualization of the pro- 
blem which integrates both graphical and analytical aspects 
of the problem. In that sense, probability trees, already intro- 
duced and illustrated in Sect. 2.2, are closely related to deci- 
sion trees. Decision trees do not have arcs. Instead, they use 
branches, which extend from each node. Branches are used 
as follows for the three main node types in a decision tree: 
(i) a decision node has a branch extending from it for every 
available option, (ii) a chance node has a branch for each 
possible outcome, and (iii) an end node has no branches suc- 
ceeding it, and returns the payoff and probability for the as- 
sociated path. Figure 12.3 is the corresponding decision tree 
for the product launch example whose influence diagram is 
shown in Fig. 12.2. 

In summary, the influence diagram and decision tree 
show different kinds of information. Influence diagrams are 
excellent for displaying a decision's overall structure, but 



they hide many details. They show the dependencies among 
the variables more clearly than the decision tree. Influence 
diagrams offer an intuitive way to identify and display the 
essential elements, including decisions, uncertainties, and 
objectives, and how they influence each other. The diagram 
provides a high-level qualitative view under which the ana- 
lyst builds a detailed quantitative model. This simplicity is 
useful since it shows the decisions and events in one's model 
using a small number of nodes. This makes the diagram very 
accessible, helping others to understand the key aspects of 
the decision problem without getting bogged down in de- 
tails of every possible branch as shown in a decision tree. 
The decision tree, on the other hand, shows more details of 
all possible paths or scenarios as sequences of branches. It 
should be used at the latter stage of the analytical analysis. 
For simpler problems, the use of influence diagrams may be 
unnecessary, and one can resort to the use of decision trees 
directly for analysis. 

Example 12.2.2: Example involving two risky alternatives 
each with probabilistic outcomes 

This example illustrates the approach for a one-stage proba- 
bilistic problem with two risky alternatives. An oil company 
is concerned that the production from one of its oil wells 
has been declining and deems that the oil revenue can be 
increased if certain technological upgrades in the extraction 
process are implemented. There are two different technolo- 



Fig. 12.3 Decision tree diagram 
corresponding to tlie influence 
diagram sliown in Fig. 12.2 



R&D 
Success 



Yes 



Fund 
R&D 



IVlarl<et siiare 



Good 



Case 1 



Yes 



Launcli product 



Yes 





IVIoderate / 
►< Case 2 



Marl<et 
Success 



Poor 



-►< Case 3 



No 



Fail 



-K Stop 



-►< Stop 



No 



Stop 



1 2.2 Decision-Making Under Uncertainty 



365 



Table 12.2 Relevant data needed to analyze the two risky alternatives 



Annual Revenue 



for the oil company (Example 12.2 


2) 




Alter- 
native 


Capital invest- 
ment (millions 
of dollars) 


State of 
economy 


Proba- 
bility of 
occurrence 


Annual revenue 
over the base case 
of doing nothing 
(millions/year) 


1 


$320 


Good 


0.25 


$120 






Average 


0.60 


$80 






Poor 


0.15 


$40 


2 


$100 


Good 


0.25 


$50 






Average 


0.60 


$35 






Poor 


0.15 


$20 



gies that can be implemented with different initial costs and 
different impacts on the oil revenue per year over a time ho- 
rizon of 6 years. However, the oil revenues could change de- 
pending on whether the oil demand is good, average or poor 
(factors that depend on the overall economy and not on the 
oil company). Table 12.2 summarizes pertinent data for the 
three possibilities and the two competing alternatives. This 
example has only two factors: the technical upgrade alter- 
natives, and the state of the economy (which in turn impacts 
oil demand). The influence diagram for this scenario is easy 
to generate and is depicted in Fig. 12.4. 

Uncertainty regarding the state of the economy can be 
modeled by subjective probabilities, i.e., by polling expert 
economists and analysts. Probability in this case is to be 
viewed as a "likelihood" of occurrence rather than as the 
classical frequentist interpretation of long-run fraction. Eco- 
nomists of the company followed such a process and deter- 
mined that the state of the economy over the next 6 years can 
be represented as chance events with probabilities of: good 
p(G) = 0.25, average p(A) = 0.60, and poor p(P) = 0.15. 

This is a single stage decision problem whose decision 
tree diagram is shown in Fig. 12.5. Such diagrams provide 
a convenient manner of visually depicting the problem, and 
also performing the associated probability propagation cal- 
culations under the classical frequentist view. If one neglects 
the time value of money, the total net savings under each of 
the three economic scenarios and for both alternatives are 
shown in the third column of Table 12.3. The expected value 
EV (some books use the term "expected worth") for each 
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Fig. 12.5 Decision tree diagram for the oil company example. Note 
that this problem has three alternative action paths, three chance events 
(good, average, poor) and seven outcomes. The objective variable is the 
expected value EV shown in the last column of Table 12.3 



Table 12.3 Calculation procedure of the expected value of both alter- 



natives for the oil company 




Alter- 
native 


Chance 
event-proba- 
bility p(x) 


Total net savings (TNS) 
over 6 years (millions $) 


Expected value 
EV(x)=TNS*p(x) 


1 


0.25 


-320 + (6)(120)=400 


100 




0.60 


-320 + (6)(80)=160 


96 




0.15 


-320-i-(6)(40)=-80 


-12 






Total 


184 


2 


0.25 


-100 + (6)(50)=200 


50 




0.60 


-100-i-(6)(35) = 110 


66 




0.15 


-100-F(6)(20) = 20 


3 






Total 


119 



Fig. 12.4 Influence diagram for Example 12.2.2 



outcome is shown in the last column by multiplying these 
values by the corresponding probability values. The total EV 
is simply the sum of these three values. Note that the annu- 
al savings are incremental values, i.e., over and above the 
current extraction system. Since the sums are positive, the 
analysis indicates that both alternatives are preferable to the 
current state of affairs, but that Alternative 1 is the better one 
to adopt since its EV is higher. 

Though probabilities are involved, the analysis assumed 
them to be known without any inherent uncertainty. Hen- 
ce, this is still a case of low epistemic and low aleotory un- 
certainties. A more realistic reformulation of this problem 
would include: (i) treating variability in the probability dis- 
tributions (case "b" of our earlier categorization shown in 
Fig. 12.1), and (ii) including the inherent uncertainties in the 
models used to predict the annual revenues for each of the 
three economy states (case "d"). ■ 
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1 2.2.3 Modeling Chance Events 

(a) Discretizing probability distributions 

Probability of occurrence of chance and uncertain events, 
and even subjective opinions, are often characterized by 
continuous probability distribution functions (PDF). Exam- 
ple 12.2.2 assumed such an approach to modeling the state 
of the economy with three distinct options. Replacing the 
continuous PDF by a small set of discretized events, especi- 
ally for complex problems, simplifies the conceptualization 
of the decision, drawing the tree events as well as the calcu- 
lation of the EVs. Two simple approximation methods are 
often used (Clemen and Reilly 2001): 
(i) Extended Pearson-Tukey method, or the 3 -point appro- 
ximation, is said to be a good choice when the distri- 
bution is not well known. It works best when PDFs are 
symmetric; while some advocate its use even when it is 
not so. Rather than knowing the entire PDF, one needs 
to determine probabilities corresponding to only three 
points on the distribution, namely, the median value 
of the random variable, and the 0.05 and 0.95 fracti- 
les. Even when the distribution is not well-known, pro- 
bability values corresponding to these bounds (lower, 
upper and most likely) can be estimated heuristically. 
The attractiveness of this approach is that it has been 
found from experience (even though there is no obvi- 
ous interpretation), that assigning probabilities of 0.63 
for the median and 0.185 for the 0.05 and 0.95 fractiles 
gives surprisingly good results in many cases. Consider 
the CDF of hourly outdoor dry-bulb temperatures for 
Philadelphia, PA for a given year, shown in Fig. 2.6 and 
reproduced in Fig. 12.6. By this approach, the represen- 
tative values of outdoor temperature would be 24.1°F, 



55.0°F and 82.0°F corresponding to 0.05, 0.50 and 0.95 
fractiles. This approach is illustrated in Fig. 12.6b whe- 
re the fan representing the PDF is discretized into three 
branches denoting chance events to which probabilities 
of 0.185, 0.63 and 0.185 are assigned, 
(ii) Bracket median method or the n-point approximation is 
a more complete and rigorous way of discretizing PDFs 
in cases where these are known with some certainty. It 
is simply found from the associated cumulative distri- 
bution function (CDF). The probability range (0-100%) 
is divided into n intervals which are equally likely to 
occur, i.e., each interval with (1/n) probability, and the 
median of each range is taken to be the discrete point 
representing that range. The higher the value of n, the 
better the approximation, but the more tedious the re- 
sulting analysis, though the discretization process itself 
is straightforward. Again considering Fig. 12.6, and for 
the case of n = 5, the representative values are easily 
read off from the y-axis scale from such a plot. The re- 
sults are shown in Table 12.4 and plotted in Fig. 12.6b. 
(b) Sequential decision-making 

Situations often arise where decisions are to be made sequen- 
tially as against just once. For example, an electric utility 
would be faced with decisions as how best to meet anticipa- 
ted load increases; such decisions may involve subsidizing 
energy conservation measures, planning for new generation 
types and capacity (whether gas turbines, clean coal, solar 
thermal, solar PV, wind,...), where to locate them,... As di- 
scussed in Sect. 2.2.4, conditional probabilities represent 
situations with more than one outcome (i.e., compound out- 
comes) that are sequential (successive). The chance result of 
the first stage determines the conditions of the next stage, and 
so on. The outcomes at each stage may be deterministic or 



Fig. 12.6 Discretizing the conti- 
nuous PDF for outdoor dry-bulb 
temperature (°F) of Philadelphia, 
PA (the CDF is copied from 
Fig. 2.6). a The continuous distri- 
bution, b the temperature values 
(and the associated probabilities 
for 0.05, 0.50 and 0.95 fracti- 
les) following the three-branch 
extended Pearson-Tulcey method, 
c those following the bracke- 
ted median method with five 
branches 
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Table 1 2.4 Discretizing the CDF of outdoor hourly dry-bulb tempe- 
rature values for Philadelphia, PA shown in Fig. 12.6 into five discrete 
ranges of equal probability 



Bracket median 
range 


Median value 


Representative 
value (°F) 


Associated 
probability 


0-0.2 


0.1 


28.9 


0.2 


0.2-0.4 


0.3 


39.9 


0.2 


0.4-0.6 


0.5 


55.0 


0.2 


0.6-0.8 


0.7 


66.9 


0.2 


0.8-1.0 


0.9 


77.0 


0.2 



may be probabilistic. Such a sequence of events can be con- 
veniently broken down and represented by a series of smaller 
simpler problems by resorting to decision tree representation 
of the problem where the random alternatives can be assig- 
ned probabilities of occurrence. Such kinds of problems are 
called dynamic decision situations (note the similarity bet- 
ween such problems and dynamic programming optimization 
problems treated in Sect. 7.10). Uncertain events also impact 
such problems and they can do so at each stage; hence, it is 
necessary to dovetail them into the natural time sequence of 
the decisions. One can distinguish between two cases: 
(i) when one has the luxury to wait till uncertain events 
resolve themselves before taking a decision (say, one 
waits for a few months to determine whether a bad eco- 
nomy is showing undeniable signs of recovery). This 
case is illustrated in Fig. 12.7. 
(ii) when one has to take decisions in the face of uncer- 
tainty. Information known with certainty at a later stage 
may not have been so at an earlier stage when a cour- 



se of action had to be determined; a non-optimal path 
may have been selected. This is similar to evaluating an 
earlier decision with 20/20 hindsight. Subsequent deci- 
sions have to be made which remedy this situation to a 
certain extent while considering future uncertainties as 
well. The following example illustrates this situation. 

Example 12.2.3: Two-stage situation for Example 12.2.2 
Consider the oil company problem treated in Exam- 
ple 12.2.2. This corresponds to a rather simple formulation 
of the decision-making process where a one-time decision 
is to be made. Assume that the first alternative (Alt#l) was 
selected involved the more expensive first cost upgrade (see 
Fig. 12.5). After a period of time during which the initial cost 
of $ 320 million was spent in technology upgrades and all 
measures relating to Alt#l were implemented, the company 
economists revised their probability estimates of the state of 
the economy for the next 6 years to: p(G) = 0.10, p(A) = 0.20 
and p(P)=0.70. This would result in a much lower EV easily 
determined as: 

Revised EV (Alt#l) = (400 x 0.10) + (160 x 0.2) 

+ (-80 X 0.70) = $ 16 million 

With hind sight, Alt#2 would have been the preferable choice 
since the corresponding revised EV(Alt#2) = 56 million. Ho- 
wever, what is done is done, and the company has to proceed 
forward and make the best of it. It starts with identifying 
alternatives (four are shown in Fig. 12.8) and evaluates them 
before making a second decision. Thus, the decision-making 



Fig. 12.7 Sequential decision- 
making process under the situa- 
tion where one has the luxury to 
wait until uncertain or chance 
events resolve themselves before 
taking a decision 
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process is now being made before the uncertain event(s) 
resolves itself since, otherwise, the penalties may be more 
severe, and some of the alternative venues may have been 
closed. Businesses of all sorts (and even individuals) expli- 
citly or implicitly follow such a sequential decision-making 
process over time since our very existence is dynamic and 
constantly subject to chance events that are unforeseen and 
unanticipated. ■ 



12.2.4 Modeling Outcomes 

The outcomes of each of the discretized events can be quanti- 
fied or modeled in a variety of ways. Expected value or worth 
or expected payoff criterion which represents the net profit 
or actual reward of the corresponding action is perhaps the 
most common, and has been illustrated in Example 12.2.2. 
The maximum payojf cnXenon is an obvious way of selecting 
the best course of action. This option will also result in the 
least opportunity loss (a term often used in decision-making 
analyses). However, a second criteria to consider is the risk 
associated with either alternative. A measure of risk charac- 
terizes the variability of the different courses of action, and 
this can be quantified by the standard deviation, or better 
still, by the normalized standard deviation, i.e., the coeffi- 
cient of variation (CV). Thus for A#l, the standard deviation 
of the three outcomes [100,96,-12] is 63.5 while for A#2 the 
outcomes are [50,66,3] with 32.7 standard deviation. The as- 
sociated CV values are 1.04 and 0.826 respectively. Hence, 
even though A#l has a higher EV, it is much riskier. Hence, a 
risk-prone person may well decide to select A#2 even though 
its EV is lower. Thus, in summary, maximizing the EV and 
minimizing the exposure to risk are two usually conflicting 
objectives; how to trade-off between these objectives is basi- 
cally a problem of decision analysis. 



1 2.2.5 Modeling Risk Attitudes 

In most cases, decisions are not taken based simply on maxi- 
mizing an objective function or the expected value as discus- 
sed in Sect. 7.2. Except for the simple types of problems, 
alternative or competing decisions have different amounts 
of risk. Different people have different risk attitudes, and 
thus are willing to accept different levels of risk. Consider 
a simple instance when one is faced with a situation with 
two outcomes: (a) the possibility of winning $ 100 with 50% 
certainty or getting $ with 50% certainty (like a gamble 
determined by flipping a coin), and (b) winning $ 30 with 
100% certainty. The expected mean value is $ 50 for case 
(a), and if the decision is to be made based on economics 
alone, one would choose outcome (a). A risk-averse person 
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Fig. 12.9 Plot illustrating the shape of the utility functions that capture 
the three risk attitudes 



would rather have a sure thing, and may opt for outcome (b). 
Thus, his utility (or the value he places to the payoff) compa- 
red to the outcome or payoff itself can be represented by the 
concave function shown in Fig. 12.9. This is a conservative 
attitude most people tend to have towards risk. However, the- 
re are individuals who are not afraid of taking risks, and the 
utility function of such risk-seeking individuals is illustrated 
by the convex function. A third intermediate category is the 
risk-neutral attitude where the utility is directly proportio- 
nal to the payoff, and is represented by a linear function in 
Fig. 12.9. Another consideration is that the amount of payoff 
may also dictate the outlook towards risk. Say, instead of 
$ 1,000 at stake, one had $ 1 million, the same individual 
who would adopt an aggressive behavior towards risk when 
$ 100 was at stake, may become conservative minded in view 
of the large monetary amount. The above example is another 
manifestation of the law of diminishing returns in economic 
theory where one differentiates between the total returns and 
the marginal returns which is the incremental benefit or va- 
lue perceived by the individual. 

A utility is a numerical rating, specified as a graph, table 
or a mathematical expression assigned to every possible out- 
come a decision maker may be faced with. In a choice bet- 
ween several alternative prospects, the one with the highest 
utility is always preferred. The utility function represents a 
way to map different units (for example, operating cost in 
dollars) into a "utility" number to the individual which cap- 
tures the user's attitude towards risk. Note that this utility 
number does not need to have a monetary interpretation. In 
any case, it assigns numerical values to the potential outco- 
mes of different decisions made in accordance with the deci- 
sion-maker's preference. 
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Table 12.5 Traditional present worth analysis 

Type of Initial Annual ope- Traditional present worth 

AC cost rating cost analysis 

High $12,000 $3,000/year 12,000 + 7.72x3,000=$ 35,183 

efficiency 

Normal $8,000 $ 4,000/year 8,000-1-7.72x4,000 = $ 38,880 



Example 12.2.4: Risk analysis for a homeowner buying an 
air-conditioner 

A homeowner is considering replacing his old air-conditio- 
ner (AC). He has the choice between a more efficient but 
more expensive unit, and one which is cheaper but less ef- 
ficient. Thus, the high efficiency AC costs more initially but 
has lower operating cost. Assume that the discount rate is 5% 
and that the life of both types of AC units is 10 years. 

Table 12.5 assembles the results of the traditional present 
worth analysis where the factor 7.72 is obtained from the 
standard cash flow analysis formula: 



A 



-,0.05,10 
A 



(1 + if - 1 
/(1+/)" 



(1.05)' 



1 



0.05(1.05) 



10 



7.72 



where P is the present worth, A is the annuity, i the discount 
rate and n the life (years). 

Traditional economic theory would unambiguously sug- 
gest the high efficiency AC as the better option since its 
present worth is lower (in this case, one wants to minimize 
expenses). However, there are instances when the homeow- 
ner has to include other considerations. Let us introduce an 
element of risk by stating that there is a possibility that he 
will relocate and would have to sell his house. In such a case, 
a simple way is to use the probability of the homeowner re- 
locating as the weight value. For the zero risk case, i.e., he 
will stay on in his house for the foreseeable future, he would 
weigh the up-front cost and the present value of future costs 
equally (i.e., utility weight of 0.5 for each since they have to 
add to 1.0 due to normalization consideration). If he is very 
sure that he will relocate, and given that he must replace his 
AC before putting his house up for sale on the market, he 
would minimize his cost by placing a utility weight of 1.0 to 
the initial cost (implying that this is the only criterion), and 
0.0 for the annual operating costs. If, say, the probability of 
his moving is 30%, the utility weight factor would be 0.7 for 
the initial cost i.e., he is weighing the initial costs more hea- 
vily, while the utility weight for the annual operating costs 
would be (1 -0.7) = 0.3. Thus, present worth for: 

• High efficiency AC = 0.7x12,000h-0.3x(7.72x3,000) 
= $15,355 

• Normal AC = 0.7 X 8,000 H- 0.3 X (7.72x4,000) = $ 14,864 
Thus, if a risk-averse attitude were to be adopted, the 

choice of the normal AC turns out to be the better option. 



Note, that there is no guarantee that this is the better choice 
in the long run. If the homeowner ends up not having to sell 
his house, then he has made a poor choice (in fact, he is very 
likely to end up paying more than that calculated above since 
the fact that electricity rates are bound to increase over time 
has been overlooked in the analysis). ■ 

The selection of the utility weights was somewhat ob- 
vious in the previous example since they were tied to the 
probability of the homeowner relocating to another location. 
The weights are to be chosen on some rational basis, say in 
terms on their relative importance to the analyst. A function 
commonly used for modeling utility is the exponential func- 
tion which models a constant risk attitude mind-set: 



t/exp(x) = a - fe ■ exp 



R 



(12.1) 



where R is a parameter characterizing the person's risk to- 
lerance, X is the minimum monetary value below which 

mm -' 

the individual would not entertain the project, and the coef- 
ficients "a" and "b" are normalization factors. Large values 
of R, corresponding to higher risk tolerance, would make 
the exponential function flatter, while smaller values would 
make it more concave which is reflective of more risk-aver- 
se behavior. The above function is a general form and can 
be simplified for certain cases; its usefulness lies in that it 
allows a versatile set of utility functions to be modeled. The 
following example illustrates these notions more clearly. 

Example 12.2.5: Generating utility functions 
Consider the case where a person wishes to invest in a ventu- 
re. Assuming that his minimum cut-off threshold x = $ 500, 

'^ mm ' 

the utility functions following Eq. 12.1 for two different risk 
tolerances: R=$ 5,000 and R=$ 10,000 are to be generated. 
One needs to determine the numerical values of the coeffi- 
cient a and b by selecting the limits of the function f/expc.v)- A 
common (though not imperative) choice for the normalizati- 
on range is 0-1 corresponding to the minimum and maximum 
positive payoffs. Then, for R = 5 ,000, with the lower limit set at 
x=x,^_^, = 500 yields: = a - fe ■ exp(- ^^^) from where 
a=b. Thus, in this case, the utility function could have been 
expressed without the coefficient "b". For the upper limit, one 
takesx=x_=x_+ Ryielding 1 = a - a • exp(- '^'^^) 
from where a=1.582=b. Similarly for R= 10,000, one also 
gets a=1.582=b. The corresponding utility functions for a 
range of values for the variable x extending from -3,000 to 
5,000 are plotted in Fig. 12.10a. Note that all payoff values 
below the minimum have negative utility values (i.e., a loss), 
while the utility function is steeper for the lower value of R. 
Thus, someone with a lower risk tolerance would assign a 
higher utility value for a given value of x, which is characte- 
ristic of a more risk-averse outlook. 
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Fig. 12.10 a Utility functions of 
Example 12.2.5 (x^^^.__ = 500). 
b Comparison of the exponential 
and logarithmic utility functions 
using values from case (a) 
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The exponential utility function is used to characterize 
constant risk aversion, i.e., the risk tolerance is the same ir- 
respective of the amount under risk. Another common atti- 
tude towards risk displayed by individuals is the decreasing 
risk aversion quality which reflects the attitude that people's 
risk tolerance decreases as the payoff increases. This is ex- 
pressed as: 

(/iogW=« + log^^^|^ (12.2) 



Figure 12.10b provides a comparative evaluation of the two 
models on a common graph rescaled such that they have the 



same values at both extremities as shown, namely x = x =500, 

-' mm 

and x=(R-x . )= 5,500. One notes that as expected the loga- 
rithmic model has slightly higher utility values for higher va- 
lues of the wealth variable indicating that is representative of 
a more risk-averse outlook. In any case, both models have the 
same general shape and are quite close to each other. ■ 

Example 12.2.6: Reanalyze the situation described in 
Example 12.2.2 using the exponential utility function given 
by Eq. 12.1. Normalize the function as illustrated in Exam- 
ple 12.2.5 such that x =$ 2 M, and consider two cases for 

^ mm 

the risk tolerance: R=$ 100 M, and R=$ 200 M. 
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Total 



1.0069 



Table 1 2.6 Expected values for the two cases being analyzed (Example 


12.2.6) 








Alternative Chance event 
probability p(x) 


Total net savings (TNS) over 
6 years (millions $) 


Case 1 

X . =$2M,R= 

nun 


= $100M 


Case 2 

X . =$2M,R: 

imo 


= $200M 




U(x) 


U(x)*p(x) 


U(x) 


U(x)*p(x) 


1 0.25 


-320 + (6)(120)=400 


1.5524 


0.3881 


1.3657 


0.3414 


0.60 


-320 + (6)(80)=160 


1.2561 


0.7537 


0.8640 


0.5184 


0.15 


-320 + (6)(40) = -80 


-2.0099 


-0.3015 


-0.8018 


-0.1203 




Total 




0.8403 




0.7396 


2 0.25 


-100 + (6)(50) = 200 


1.3636 


0.3409 


0.9942 


0.2486 


0.60 


-100 + (6)(35)=110 


1.0448 


0.6269 


0.6601 


0.3961 


0.15 


-100 + (6)(20)=20 


0.2606 


0.0391 


0.1362 


0.0204 



0.6650 



Following Example 12.2.5, the coefficients in Eq. 12.1 
are still found to be a=b= 1.582. However, the utility values 
are not between and 1 . The results of this example are ta- 
bulated in Table 12.6. 

Alt#2 is better under Case 1 which corresponds to the 
more risk-averse situation (i.e., under the lower value of R= 
$ 100 M) while Alt#l is preferable in the other case. Though 
one may not have expected a reversal in the preferable alter- 
native, the trend is clearly as expected. When one is more 
risk-averse, the possibility of losing money outweighs the 
possibility making more money. 

This example is meant to illustrate that the selection of 
the risk tolerance value R is a critical factor during the de- 
cision-making process. One may well ask, given that it may 
be difficult to assign a representative value of R, whether 
there is an alternative manner of selecting R. A break-even 
approach could be adopted. Instead of selecting a value of R 
at the onset, the problem could have been framed as one of 
determining that value of R where both alternatives become 
equal. It is left to the reader to determine that this break-even 
value is R=$ 145 M. It is now easier for the decision-maker 
to gauge whether his risk aversion threshold is higher or lo- 
wer than this break-even value, and thereby decide on the 
alternative to pursue. ■ 



1 2.2.6 Modeling Multi-attributes or Multiple 
Objectives 

Optimization problems requiring the simultaneous conside- 
ration of several objectives or criteria or goals or attributes 
fall under multi-attribute or multiple objectives optimiza- 
tion. Many of these may be in conflict with each other, and 
so one cannot realize all of the goals exactly. One example 
involves meeting the twin goals of an investor who desires 
a stock with maximum return and with minimum risk; the- 
se are generally incompatible, and therefore unachievable. 
A second example of multiple conflicting objectives can be 
found in organizations that want to: maximize profits, in- 
crease sales, increase worker wages, upgrade product quali- 



ty and reduce product cost, while paying larger dividends to 
stockholders and retaining earnings for growth. One cannot 
identify a solution where "all" objectives are met optimally, 
and so some sort of "compromise" has to be reached. Yet 
another example involves public policy decisions which mi- 
nimize risks of economic downturn while replacing fossil 
fuel-based electric power production (currently, the world- 
wide fraction is around 85%) by renewable energy sources 
so that global warming and human health problems are mit- 
igated. The analysis of situations involving multi- attribute 
optimization is a complex one, and is an area of active rese- 
arch. For the purpose of our introductory presentation, met- 
hods to evaluate multiple alternatives can be grouped into 
two general classes: 

(a) Single dimensional methods 

The objective function is cast in terms of a common metric, 
such as the cost of mitigation or avoidance. To use this single 
criterion method, though, the utility of each attribute must 
be independent of all the others, while any type of function 
for the individual utility functions can be assumed. Suppose 
one has an outcome with a number of attributes, (x,, . . ., x ). 
Utility independence is present if the utility of one attribu- 
te, say U\{x\), does not depend on the different levels of 
other attributes x . Say, one has individual utility functions 
U\(x\), ..., U„,(x„,) for m different attributes x^ through x_^. 
One simple approach is to use normalized rank or non-di- 
mensional scaling wherein each attribute is weighted by a 
factor k determined from: 



ki^ 



outcome (i) — worst outcome 
best outcome — worst outcome 



(12.3) 



Thus, k values for each attribute U can be determined with 

1 1 

values between (the worst) and 1 (the best). Note that this 
results in the curve for Ui{xi) to be the same no matter how 
the other attributes change. 

The direct additive weighting method is a popular subset 
of the single criteria weighing method because of its simpli- 
city. It assigns compatible individual weights to each utility 
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of different objectives or attributes, and then combines them 
as a weighted sum into a single objective function. The ad- 
ditive utility function is simply a weighted average of these 
different utility functions. For an outcome that has levels (x^, 
. . . , X ) on the m objectives or attributes, one can define re- 

k- 
normalized weights A:' = — , and calculate the utility of this 
' m 

outcome as (Clemen and Reilly 2001): 

m 

U{xi, ...,x„) = k[Ui(xi) H h k'JJ,„{x„) = 'Y^k'Viixj) 

(12.4) 



/ = ! 



where 



k\+k[ + -- 



k' 



1 



Based on the above model, the utility of the outcome can be 
determined for each alternative, and the one with the greatest 
expected utility is the best choice. Note that the discounted 
cash flow or net present value analysis is actually a simp- 
le method of reconciling conflicting objectives where all 
future cash flows are "converted" to present value by way 
of discounting the time value of money. This is an exam- 
ple of an additive utility function with equal weights (see 
Example 12.2.4). Note that the above discussion applies to 
multi-attributes which do not interact which each other, i.e., 
are mutually independent. The interested reader can refer to 
texts such as Clemen and Reilly (2001) or to Haimes (2004) 
for modeling utility function with interacting attributes. An 
illustrative case study of the additive weighting approach is 
given in Sect. 12.4.2. 

(b) Multi-dimensional methods 

Most practical situations involve considering multiple at- 
tributes, and decision-making under such situations is very 
widespread. However, in most cases a heuristic approach is 
adopted, often involving a quantitative matrix method. Even 
college freshmen in many disciplines are taught this method 
when basic concepts of evaluating different design choices 



are being introduced. The following example illustrates this 
approach. 

Example 12.2.7: Ascertaining risk for buildings 
Several semi-empirical methods to quantify the relative 
risk of building occupants to the hazard of a chemical-bio- 
logical-radiological (CBR) attack have been proposed (see 
Sect. 12.3.4). A guidance document ASHRAE (2003) de- 
scribes a multi-step risk analysis approach (see Sect. 12.3.1) 
which first involves defining an organization's (or a build- 
ing's) exposure level. This approach assigns (with consulta- 
tion with the building owner and occupants) levels to various 
attributes such as number of occupants, number of threats re- 
ceived towards the organization, critical nature of the build- 
ing function, time to recover to a 80% functioning level after 
an incident occurs, monetary value of the building (plus pro- 
duct, equipment, personnel, information contained...), and 
the ease of public access to the building. 

ASHRAE (2003) gives an example of an office building 
near a small rural town with 50 employees and typically five 
visitors at any time (refer to Table 12.7). The value of build- 
ing is estimated to be $ 3 million, and the building function 
is low in criticality with only two threats received last year. 
Access to the building is restricted because card readers are 
required for entry. The expected recovery time is estimated 
to be 3 business days. The exposure level matrix (Table 12.7) 
shows the categories and attribute ranges levels estimated by 
the building management. Next, the levels specific to this cir- 
cumstance are determined, and these are entered in the row 
corresponding to "Score". For example, the level for number 
of occupants is two, and so on. The third step involves selec- 
ting weighting factors for each of the attributes (which need 
to add up to unity or 100%), after which the weighted score 
for each factor is calculated. Finally, sum of calculated sco- 
res is deduced which provides a single metric of the building/ 
organization's exposure level (=1.8 in this case). No action 
is probably warranted in this case. ■ 

It is obvious from the above example that the matrix met- 
hod is highly subjective and empirical; in all fairness, it is 



Table 1 2.7 Sample exposure level matrix (level 1-5) as suggested in ASHRAE (2003) for different hazards (Example 12.2.7) 



Category 




Number of 
people 


Received 
threats 


Critical nature of 
building 


Recovery time 


Dollar value of 
facihty 


Ease of public 
access 


Level 


1 




0-10 


0-1 


Low 


< 2 days 


<S 2 Million 


Low 


2 




11-60 


2-4 


Low-Medium 


2-14 days 


$2-$ lOM 


Low-Medium 


3 




61-120 


5-8 


Medium 


14 -90 days 


$10-$50M 


Medium 


4 




121-1,500 


9-12 


Medium-High 


3-6 months 


$50-$ 100 M 


Medium-High 


5 




> 1,500 


>12 


High 


> 6 months 


>$ 100 M 


High 


Determine 


d score" 


2 


2 


1 


2 


2 


3 


Weighting 


factor 


20% 


20% 


30% 


20% 


10% 


10% 


Calculated 


score" 


0.4 


0.4 


0.3 


0.4 


0.2 


0.3 


Exposure level" 


1.8 


Sum of calculated scores 









" These values are specific to the building/organization being assessed in this example. 
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difficult, if not impossible, to quantify the various attributes 
in an analytical manner without resorting to very complex 
analytical methods which have their own intrinsic limitati- 
ons. A more complete illustrative example of multi-attribute 
decision-making is presented in Sect. 12.4.2. The more rigo- 
rous manner of tacking multi- attribute problems is to ana- 
lyze the attributes in terms of their original metrics. In such 
cases, decision analysis requires the determination of multi- 
attribute utility functions instead of a single utility function. 
These multi-attribute functions would evaluate the joint uti- 
lity values of several measures of effectiveness toward fulfil- 
ling the various objectives. However, manipulation of these 
multi-attribute utility functions can be extremely complex, 
depending on the subjectivity of the metrics, the size of the 
problem and the degree of dependence among the various 
objectives. Hence, the suggested procedure is to use mul- 
ti-dimensional analysis for initial screening of alternatives, 
reduce the dimension of the problem, and then resort to sing- 
le-dimension methods for final decision-making. 

Several techniques have been proposed for dealing with 
multi-dimensional problems. First, one needs to distinguish 
between objectives which are separate, and those that inter- 
act, i.e., are somewhat related (the latter is much harder to 
analyze). For the former instance, one modeling approach, 
called, non-compensatory modeling, is to judge the alterna- 
tives on a attribute-by-attribute basis. Sometimes, one alter- 
native is better than all others in all attributes. This is a case 
of dominance, and there is no ambiguity as to which is the 
best option. Unfortunately, this occurs very rarely. 

A second approach is to use the method of feasible ran- 
ges, where one establishes minimum and maximum accep- 
tance values for each attribute, and then rejects alternatives 
whose attributes fall outside these limits. This approach is 
based on the satisficing principle which advocates that sa- 
tisfactory rather than optimal performance or design is good 
enough for practical decision-making. 

A third approach is to adopt a technique known as goal 
programming which is an extension of linear programming 
(Hillier and Lieberman 2001). The objectives are rank orde- 
red in priority and optimized sequentially. First, specific nu- 
merical goals for each of objective are established, an objec- 
tive function for each objective is formulated, and an optimal 
solution is sought that minimizes the weighted sum of the 
deviations of the objective functions from these respective 
target goals according to one of the following goals: a lower, 
one-sided goal which sets a lower limit that should be met; a 
upper, one-sided goal which sets an upper limit that should 
not be exceeded; two-sided goal where both level limits are 
set. Goal programming does not attempt to maximize or 
minimize a single objective function as does the linear pro- 
gramming model. Rather, it seeks to minimize the deviations 
among the desired goals and the actual results according to 
the priorities assigned. Note the similarity of this approach 



to the penalty function method (Sect. 7.3.4). The interested 
reader can refer to Haimes (2004) and other texts for an in- 
depth discussion of multi-dimensional methods. 



1 2.2.7 Analysis of Low Epistemic but High 
Aleatory Problems 

Let us now deal with situations where the structure of the 
problem is well defined, but aleatory uncertainty is present 
(see Fig. 12.1). For example, the random events impacting 
the system under study are characterized by a probability 
distribution and have inherent uncertainties large enough to 
need explicit consideration. Monte Carlo analysis (MCA) 
methods were first introduced for calculating uncertainty 
propagation in data (Sect. 3.7.3). In general, MCA is a met- 
hod of analysis whereby the behavior of processes or sys- 
tems subject to chance events (i.e., stochastic) can be better 
ascertained by artificially recreating the probabilistic occur- 
rence of these chance events a large number of times. MCA 
methods are also widely used in decision-making problems, 
and serve to complement probabilistic analytical methods di- 
scussed earlier. Recall that, in essence, MCA is a "numerical 
process of repeatedly calculating a mathematical or empiri- 
cal operator in which the variables within the operator are 
random or contain uncertainty with prescribed probability 
distributions" (Ang and Tang 2007). The following example 
will serve to illustrate the application of MCA to decision- 
making. 

Example 12.2.8: Monte Carlo analysis to evaluate alterna- 
tives for the oil company example 

Consider Example 12.2.2 dealing with the analysis of two 
risky alternatives faced by an oil company. The risk is due 
to the uncertainty in how the economy will perform in the 
future. Only three states were considered: good, average 
and poor, and discrete probabilities were assigned to them. 
The subsequent calculation of the expected value EV was 
straightforward, and did not have any uncertainty associated 
with it (hence, this is a low or no epistemic uncertainty si- 
tuation). The probabilities associated with the future state of 
the economy are very likely to be uncertain, but the previous 
analysis did not explicitly recognize this variability. The ana- 
lytic treatment of such variability in the PDF is what makes 
this problem a high aleatory one. 

For this example, it was assumed that the probability 
distributions p(Good) and p(Average) are normally distri- 
buted with a Coefficient of Variation (CV) of 10% (i.e., the 
standard deviation is 10% of the mean), and consequently 
p(Poor) = 1 - p(Good) - p (Average) . A software program was 
used to perform a MCA analysis with 1,000 runs. The results 
are plotted in Fig. 12.1 1 both as a histogram and as a cumu- 
lative distribution (which can be depicted as either descen- 
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Fig. 12.11 Results of the Monte Carlo analysis with 1,000 runs (Example 12.2.8). a Histogram and descending cumulative distribution for the 
Expected Value (EV) for Alt#l . b Histogram and ascending cumulative distribution for the Expected Value (EV) for Alt#2 



ding or ascending). Often, MCA results are plotted thus since 
these two figures provide useful complementary information 
regarding the shape of the distribution as well as values of 
the EV random variable at different percentiles. The 5 and 
95% percentiles as well as the mean values are indicated. 
Thus, for Alt#l, the mean value of EV is 184.0 with the 90% 
uncertainty range being { 152.8, 214.3 } .One note that there is 
no overlap between the 5% value for Alt#l and the 95% for 
Alt#2, and so in this case, Alt#l is clearly the better choice 
despite the variability in the probability distributions of the 
state of the economy. With higher variability, there would 
obviously be greater overlap between the two distributions. ■ 



1 2.2.8 Value of Perfect Information 

The value of collecting additional information to reduce the 
risk, capturing heuristic knowledge or combining subjecti- 
ve preferences into the mathematical structure are intrinsic 
aspects of problems involving decision-making. As stated 
earlier, inverse models can be used to make predictions 
about system behavior. These have inherent uncertainties 



(which may be large or small depending on the specific si- 
tuation), and adopting a certain inverse model over potential 
competing ones involves the consideration of risk analysis 
and decision-making tools. The issue of improving model 
structure identification by way of careful experimental de- 
sign has been covered in Chap. 6, while the issue of heuristic 
information from experts merits more discussion. This sec- 
tion will present a method by which one can gauge the value 
of additional information, and thereby ascertain whether the 
cost of doing so is justified for the situation at hand. 

An expert's information is said to be perfect if it is always 
correct (Clemen and Reilly 2001). Then, there is no doubt 
about how future events will unfold, and so, decision can be 
made without any uncertainty. The after-the-fact situation is 
of little value since the decision has to be made earlier. The 
real value of perfect information is associated with the benefit 
which it can provide before the fact. The following example 
will illustrate how the concept of value of perfect information 
can be used to determine an upper bound of the expected be- 
nefit, and thereby provide a manner of inferring the maximum 
additional monetary amount one can spend for obtaining this 
information. 
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Example 12.2.9: Value of perfect information for the oil 
company 

Consider Example 12.2.2 where an oil company has to deci- 
de between two alternatives that are impacted depending on 
how the economy (and, hence, oil demand and sales) is likely 
to fare in the future. Perfect information would mean that that 
the future is known without uncertainty. One way of stating 
the value of such information compared to selecting Alt#l 
is to say that the total net savings (TNS) would be $ 400 M 
if the economy turns out to be good, $ 160 M if it turns out 
to be average and a loss of $ 8 M if bad (see Table 12.6). 
However, this line of enquiry is of little value since, though it 
provides measures of opportunity loss, it does not influence 
any phase of the current decision-making process. The better 
manner of quantifying the benefit of perfect information is to 
ascribe an expected value to it, called EVPI (expected value 
of perfect information) under the assumption that any of the 
three chance events (or states of nature) can occur. 

The decision tree of Fig. 12.5 is simply redrawn as shown 
in Fig. 12.12 where the chance node with its three branches 
is the starting point, and each branch then subdivides into 
the two alternatives. Then, perfect information would have 
led to selecting Alt#l during the good and average economy 
states (since their TNS values are higher), and Alt#2 during 
the poor economy state. The EVPI is given by: 

EPVI = 400 X 0.25 + 160 x 0.6 + 20 x 0.15 

= $ 199 M. 

If the decision maker is leaning towards Alt#l (since it has 
the higher EV as per Table 12.3), then the additional bene- 
fit of perfect information would be ($ 199 M-$ 184 M) = 
$ 15 M. Hence, the option of consulting experts in order to 
reduce the uncertainty in predicting the future state of the 



Alt#1 



TNS 



$400l\/l 




Alt#2 



$110l\/l 
$-80l\/l 

$20l\/l 



Fig. 12.12 Decision tree representation for determining expected 
value of perfect information (EVPI). TNS is tlie total net savings 
(Example 12.2.9) 



economy should cost no more than $ 15 M (and, preferably, 
much less than this amount). ■ 



1 2.2.9 Bayesian Updating Using Sample 
Information 

The advantage of adopting a Bayesian approach whereby 
prior information about the population can be used to en- 
hance the value of information from a collected sample has 
been discussed in Sect. 2.5 (from the point of view of sam- 
pling) and in Sect. 4.6 (for making statistical inferences). 
Sect. 2.5.1 presented relevant mathematical equations while 
Example 2.5.3 illustrated this process with an example whe- 
re the prior PDF of defective pile foundations in a civil en- 
gineering project is modified after another pile is tested and 
found defective. If each defective proportion has a remedia- 
tion cost associated with it, it is simple to correct the origi- 
nally expected EV so as to be more reflective of the posterior 
situation. Thus, the Bayesian approach has obvious applica- 
tion in decision-making. Here, the prior information regar- 
ding originally assumed chance events (or states of nature) 
can be revised with information gathered from a sample. 
This may take the form of advice or opinion from an expert 
or even a sort of action such as polling or testing. A company 
wishing to launch a new product can perform a limited mar- 
ket release, gather information from a survey, and reassess 
their original sales estimates or even their entire marketing 
strategy based on this sample information. The following 
example illustrates the instance where prior probabilities of 
chance events are revised into posterior probabilities in light 
of sample information. 

Example 12.2.10: Benefit of Bayesian methods for evalua- 
ting real-time monitoring instruments in homes 
Altering homeowner behavior has been recognized as a key 
factor in reducing energy use in residences. This can be 
done by educating the homeowner on the electricity draw 
of various ubiquitous equipment (such as television sets, 
refrigerators, freezers, dishwashers and washers/dryers) 
implications of thermostat setup and setdown, switching 
off unneeded lights,... The benefits of this approach can 
be enhanced by a portable product consisting of a wireless 
monitoring cum display device called The Energy Detecti- 
ve (TED) which allows instantaneous energy use of vari- 
ous end-use devices to be displayed in real-time on a small 
screen along with an estimate of the electricity cost to-date 
into the billing period, as well as projections of what the 
next electric bill amount is likely to be. Such feed-back in- 
formation is said to result in enhanced energy savings to the 
household as a result of increased homeowner awareness 
and intervention. 



376 



1 2 Risk Analysis and Decision-IVlal<ing 



Table 12.8 Summaiy table of 
the Bayesian calculation proce- 
dure for Example 12.2.10 



Chance (a) (b) (c) 

events Energy reduction Prior Conditional 

(f ) probability p. probability L. 



(d) 

(b)x(c) 



(e) 

Posterior probability 

p'. = (d.)/sum(d) 



s, 


0-2% (1%) 


0.10 


0.1796 


0.01796 


0.088 


s. 


2-A% (3%) 


0.20 


0.2182 


0.04364 


0.214 


S3 


4-6% (5%) 


0.30 


0.1916 


0.05748 


0.282 


S4 


6-8% (7%) 


0.30 


0.1916 


0.05748 


0.282 


S5 


8-10% (9%) 


0.10 


0.2702 


0.02702 


0.133 


Total 




1.00 




0.20358 


1.000 



A government agency is planning to implement this con- 
cept in the context of a major "greening" project involving 
several thousands of homeowners in a certain neighborhood 
of a city. Consultants to this project estimate the energy re- 
duction potential f discretized into 5 bins (to be interpreted 
as chance events or states of nature) Sj through Sj along with 
the associated probability p of achieving them (shown under 
the first three columns of Table 12.8). 

The prior expected average energy reduction throughout 
the population, i.e., the entire neighborhood= 

5 
Y^ /, . Pi = (0.01)(0.1) + (0.03)(0.2) + (0.05)(0.3) 

! = 1 

-h (0.07)(0.3) + (0.09)(0.1) = 0.051 or 5.1%. 

The agency is concerned whether the overall project target of 
5% average savings will be achieved, and decides to adopt a 
Bayesian approach before committing to the entire project. A 
sample of 20 households is selected at random, TED devices 
are installed with the homeowners educated adequately, and 
energy use before and after are monitored for say, 3 months 
each. The following sample information R was computed 
in terms of the energy savings for each of the five chance 
events: 

• Sj-3 homes (i.e., in 3 homes, the energy savings were bet- 
ween 0-2%), 

• S -4 homes, S -6 homes, S,-6 homes, S -1 home. 
Recall from Chap. 2 that the probability of event B with 

event A having occurred can be determined from its "flip" 
conditional probability: 



p(R2lS2)^p{x^Alp^^.T) 



p{BIA) = 



p{A/B) ■ p{B) 



(12.5) 



p{A/B)-p{B) + p{A/B)-p{B) 



In the context of this example, the conditional probabilities 
or the likelihood function needs to be computed. Assuming 
the energy reduction percentage x to be a random binomial 
variable, the conditional probabilities or the likelihood func- 
tion for this sample are calculated as shown below (listed 
under the fourth column in the table): 
M^i/5i)=/7(x = 3/;7 = 0.1) 

20 



(0.1)^(0.9)1 



0.1796 



^'^ 1(0.2)^(0.8)1^ = 0.2182 



p{R^/S^)^p{x^6/p^03) 



20 



(0.3)''(0.7)'^ = 0.1916 



P{Ra/S,)^P{x^6/p^Q3) 



^^ )(0.3)*'(0.7)i'i = 0.1916 



p{R,/Ss)^p{x^\/p^Q.\) 



20 
1 



(0.1)1(0.9)!'* = 0.2702 



The posterior probabilities are easily determined following 
Eq. 12.5 with the computational procedure also stated in 
the table. As expected, the energy reduction amounts are 
modified in light of the Bayesian updating. For example, 
chance event S^ which had a prior of 0.3 now has a posterior 
of 0.282, and so on. The revised average energy reduction 
throughout the population, i.e., the entire neighborhood = 
Ef=i // ■ P'i = 0-0531 or 5.31% which reinforces the fact 
that the target of 5% average energy reductions could be met. 
This is the confirmation brought about by the information 
contained in the sample of 20 households. It is obvious that 
the 20 households selected should indeed be random and a 
representative sample of the entire neighborhood (which is 
not as simple as it sounds — see Sect. 4.7 where the issue of 
random sampling is discussed). Also, the larger the sample, 
the more accurate the prediction results. However, there is a 
financial cost and a time delay associated with the sampling 
phase. The concept of value of perfect information, indicati- 
ve of the maximum amount of funds that can be spent to gat- 
her such information (see Sect. 12.2.8), can also be extended 
to the problem at hand. Bayesian methods applied to decisi- 
on-making problems embody a vast amount of literature, and 
the interested reader can refer to numerous advanced texts in 
this subject (for example, Gelman et al. 2004). ■ 
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12.3 Risk Analysis 

1 2.3.1 Formal Treatment of Risk Analysis 

Without uncertainty, there is no risk. Risk analysis is a quan- 
titative tool or process which allows for sounder decision- 
making by explicitly treating risks associated with issue at 
hand. Risk is a ubiquitous aspect of everyday life. It has 
different connotations in both every-day and scientific con- 
texts, but all deal with the potential effects of a loss (finan- 
cial, physical,...) caused by an undesired event or hazard. 
This section will describe the basic concepts of risk and risk 
analysis in general, while a case study example is presented 
in Sect. 12.4.1. 

The analysis of risk can be viewed as a more formal and 
scientific approach to the well-known Murphy's Law. It con- 
sists of three sub-activities elaborated below. Risk analysis 
provides a framework for determining the relative urgency of 
problems and the cost-effective allocation of limited resour- 
ces to reduce risk (Gerba 2006). Thus, preventive measures, 
remedial actions and control actions can be identified and 
targeted towards sources and situations which are most cri- 
tical. The formal study of risk analysis arose as a discipline 
in the 1940s usually attributed to the rise of the nuclear in- 
dustry. It has subsequently been expanded to cover numerous 
facets of our everyday life. 

Though different sources categorize them a little diffe- 
rently, the formal treatment of risk analysis includes three 
specific and interlinked aspects (NRC 1983; Haimes 2004; 
USCG 2001): 

(a) risk assessment which involves three sub-activities: 
(i) hazard identification: identifying the sources and 
nature of the hazards (either natural or man-made), 
(ii) probability of occurrence: estimating the likeli- 
hood or frequency of their occurrence expressed as 
a probability. Recall that in Sect. 2.6, three kinds of 
probabilities were discussed: objective or absolute, 
relative, and subjective probabilities, 
(iii) evaluating the consequences (monetary, human 
life,. . .) were they to occur. This can be done by one 
of three approaches: Qualitative, which is based on 
common sense or tacit knowledge of experienced 
professionals, and wherein guidance is specific to 
measures which can/ought to be implemented wit- 
hout explicitly computing risk. Generally, this type of 
heuristic approach is extensively used during the ear- 
ly stages of a new threat (such as that associated with 
recent extraordinary incidents); Empirical, which is 
based on some simple formulation of the risk functi- 
on that involves combining heuristic weights to some 
broad measures characterizing the system; Quantita- 



tive, which is based on adopting scientific and statis- 
tical approaches that have the potential of providing 
greater accuracy in applications where the hazards 
are well-defined in their character, their probability 
of occurrence and their consequences. 
Several books have been written on the issue of quan- 
titative risk assessment, both in general terms (such as 
that by Haimes 2004) and for specific purposes (such 
as microbial by Haas et al. 1999). Quantitative risk as- 
sessment methods are tools based on accepted and stan- 
dardized mathematical models that rely on real data as 
their inputs. This information may come from a random 
sample, previously available data, or expert opinion. 
The basis of quantitative risk assessment is that it can 
be characterized as the probability of occurrence of an 
adverse event or hazard multiplied by its consequence. 
Since both these terms are inherently such that they 
cannot be quantified exactly, a major issue in quanti- 
tative risk assessment is how to simulate, and thereby 
determine, confidence levels of the uncertainty in the 
risk estimates. Very sophisticated probability based sta- 
tistical techniques have been proposed in the published 
literature involving traditional probability distributions 
in conjunction with Monte Carlo and bootstrap techni- 
ques as well as artificial intelligence methods such as 
fuzzy logic (Haas et al. 1999). 

There has been a certain amount of skepticism by po- 
licy and decision makers towards quantitative risk as- 
sessment models even when applied to relatively well- 
understood systems. The reasons for this lack of model 
credibility have been listed by Haimes (2004) and in- 
clude such causes as naive or unrealistic models, un- 
calibrated models, poorly skilled users, lack of multi- 
objective criteria, overemphasis on computer models as 
against tacit knowledge provided by skilled and expe- 
rienced practitioners. Nonetheless, the personal opinion 
of several experts is that these limitations should not be 
taken as a deterrent in developing such models, but be 
taken as issues to be diligently addressed and overcome 
in the future, 
(b) risk management which is the process of controlling 
risks, weighing alternatives and selecting the most ap- 
propriate action based on engineering, economic, legal 
or political issues. Risk management deals with how 
best to control or minimize the specific identified risks 
through remedial planning and implementation. These 
include (i) enhanced technical innovations intended to 
minimize the consequences of a mishap, and (ii) increa- 
sed training of concerned personnel in order to both re- 
duce the likelihood and the consequences of a mishap 
(USCG 2001). Thus, good risk management and control 
cannot prevent bad things from happening altogether. 
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but they can minimize both the probability of occurren- 
ce as well as the consequences of a hazard. Risk ma- 
nagement includes risk resolution which narrows the set 
of remedial options (or alternatives) to the most promi- 
sing few by determining their risk leverage factor. This 
measure of their relative cost-to-benefit is computed as 
the difference in risk assessment estimates before and 
after the implementation of the specific risk action plan 
or measure divided by its implementation cost. 
Risk management also includes putting in place response 
and recovery measures. A major natural disaster occurs 
in the U.S. on an average of 10 times/year with minor 
disasters being much more frequent (AIA 1999). Once 
such disasters occur, the community needs to respond 
immediately and provide relief to those affected. Hence, 
rapid-response relief efforts and longer-term rebuilding 
assistance processes have to be well-thought out and in 
place beforehand. Such disaster response efforts are ty- 
pically coordinated by federal agencies such as the Fe- 
deral Emergency Management Agency (FEMA) along 
with national and local volunteer organizations. 
(c) risk communication which can be done both on a long- 
term or short-term basis, and involves informing the 
concerned people (managers, stakeholders, officials, 
public,. . .) as to the results of the two previous aspects. 
For example, at a government agency level, the announ- 
cement of a potential terrorist threat can lead to the im- 
plementation of certain immediate mitigation measures 
such as increased surveillance, while on an individual 
level it can result in people altering their daily habits by, 
say, becoming more vigilant and/or buying life safety 
equipment and storing food rations. 
The above aspects, some have argued in recent years, are 
not separate events but are interlinked since measures from 
one aspect can affect the other two. For example, increased 
vigilance can deter potential terrorists and, thus, lower the 
probability of occurrence of such an event. As pointed out 
by Haimes (2004), risk analysis is viewed by some as a se- 
parate, independent and well-defined discipline as a whole. 
On the other hand, there are others who view this discipline 
as being a sub-set of systems engineering that involves (i) 
improving the decision-making process (involving planning, 
design, and operation), (ii) improving the understanding of 
how the system behaves and interacts with its environment, 
and (iii) incorporating risk analysis into the decision-ma- 
king process. The narrower view of risk analysis can, no- 
netheless, provide useful and relevant insights to a variety of 
problems. Consequently, its widespread appeal has resulted 
in it becoming a basic operational tool across the physical, 
engineering, biological, social, environmental, business and 
human sciences areas, which in turn has led to an exponen- 
tial demand for risk analysts in recent years (Kammen and 
Hassenzahl 1999). 



1 2.3.2 Context of Statistical Hypothesis Testing 

Statistical hypothesis testing involving Type I and Type II 
en'ors have been discussed in Sect. 4.2.2. Recall that such 
statistical decisions inherently contain probabilistic ele- 
ments. In other words, statistical tests of hypothesis do not 
always yield conclusions with absolute certainty: they have 
in-built margins of error just like jury trials sometimes hand 
down wrong verdicts. Hence, there is the need to distinguish 
between two types of errors. Concluding that the null hypo- 
thesis is false when in fact it is true was called a Type I error, 
and represented the probability a (i.e., the pre-selected sig- 
nificance level) of erroneously rejecting the null hypothesis. 
This is also called the "false negative" or "false alarm" rate. 
The flip side, i.e. concluding that the null hypothesis is true 
when in fact it is false, was called a Type II error and repre- 
sented the probability yS of erroneously accepting the alterna- 
te hypothesis, also called the "false positive" rate. 

The two types of error are inversely related. A decrease in 
probability of one type of error is likely to result in an increa- 
se in the probability of the other. Unfortunately, one cannot 
simultaneously reduce both by selecting a smaller value of 
a. The analyst usually selects the significance level depen- 
ding on the tolerance, or seriousness of the consequences of 
either type of error specific to the circumstance. Hence, this 
selection process can be viewed as a part of risk analysis. 

Example 12.3.1: Evaluation of tools meant for detecting 
and diagnosing faults in chillers 

Refer to Example 2.5.2 in Sect. 2.5.1 where the case of 
fault detection of chillers from performance data is illustra- 
ted by a tree diagram from which Type I and II errors can be 
analyzed. Let us consider an application where one needs to 
evaluate the performance of different automated fault detec- 
tion and diagnosis (FDD) methods or tools meant to conti- 
nuously monitor large chillers and assess their performance. 
Reddy (2007) proposed a small set of quantitative criteria in 
order to perform an impartial evaluation of such tools being 
developed by different researchers and companies. The eva- 
luation is based on formulating an objective function whe- 
re one minimizes the sum of the costs associated with false 
positives and those of missed opportunities. Consider the 
FDD tool as one, which under operation, first sorts or flags 
incoming system performance data into either fault-free or 
faulty categories with further subdivisions as described be- 
low (Fig. 12.13): 

(a) False negative rate denotes the probability of calling 
a faulty process good, i.e. missed opportunity loss 
(Type II en'or); 

(b) Correct fault-free detection rate denotes the probability 
of calling a good process good; 

(c) False positive rate denotes the probability of calling a 
good process faulty, i.e. false alarm (Type I error); 
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Fig. 12.13 Evaluation procedure 
for detecting and diagnosing 
faults while monitoring an en- 
gineering system. (From Reddy 
2007) 
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(d) Correct faulty detection rate denotes the probability of 
calling a faulty process faulty. 
Note that in case only one crisp fault detection threshold 
value is used, probabilities (a) and (d) should add to uni- 
ty, and so should (b) and (c). Thus, if the correct fault-free 
detection rate is signaled by the FDD tool as 95%, then the 
corresponding false alarm rate is 5%. Once a fault has been 
detected correctly, there are four possibilities, each of which 
has a cost implication (in terms of technician's time to deal 
with it — see Fig. 12.13): 

• Correct and unique diagnosis, where the fault is correctly 
and unambiguously identified; 

• Correct but non-unique, where the diagnosis rules are not 
able to distinguish between more than one possible fault; 

• Unable to diagnose, where the observed fault patterns do 
not correspond to any rule within the diagnosis rules; 

• Incorrect diagnosis, where the fault diagnosis is done im- 
properly. 

The evaluation of different FDD methods consisted of 
two distinct aspects: 

(a) how well is fault detection done ? Given the rudimen- 
tary state of practical implementation of FDD methods 
in HVAC equipment, say chillers, it would suffice for 
many service companies merely to know whether a 
fault has occurred; they would then send a service tech- 
nician to diagnose and fix the problem; and 

(b) how well does the FDD methodology perform overall? 
Each of the four diagnosis outcomes will affect the time 



needed for the service technician to confirm the sugge- 
sted diagnosis of the fault (or to diagnose it). 
Varying the fault detection threshold affects the sensiti- 
vity of detection, i.e., the Connect Fault-Free Detection Rate 
and the False Positive Rate are affected in compensatory and 
opposite ways. Since these rates have different cost implica- 
tions, one cannot simply optimize the total error rate. Inste- 
ad, FDD evaluation can be stated as a minimization problem 
where one tries to minimize the sum of penalties associated 
with false positives (which require a service technician to 
needlessly respond to the service call) as against ignoring the 
alarm meant to signal the onset of some fault (which could 
result in an extra operating cost incurred due to excess energy 
use and/or shortened equipment life). The frequency of oc- 
currence of different faults over a long enough period (say, 3 
months of operation) need to be explicitly considered as well 
as their energy cost penalties. Such considerations led to the 
formulation of objective functions which were applicable to 
any HVAC&R system. Subsequently, these general expres- 
sions were then tailored to the evaluation of FDD methods 
and tools appropriate to large chillers, and specific numerical 
values of several of the quantities appearing in the normal- 
ized rating expression were proposed based on discussions 
with chiller manufacturers and service companies as well 
as analysis of fault-free and faulty chiller performance data 
gathered in a laboratory from a previous research study. This 
methodology was used to assess four chiller FDD methods 
using fault-free data and data from intentionally introduced 
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faults of different types and severity collected from a labora- 
tory chiller (Reddy 2007). 



1 2.3.3 Context of Environmental Risk 
to Humans 

One of the most important shifts in environmental energy po- 
licy was the acceptance in the late 1980s of the role of risk 
assessment and risk management in environmental decision- 
making (Masters and Ela 2008). One approach to quantify- 
ing such environmental risks (as against accidents where the 
causality is direct and obvious) is in terms of premature mor- 
tality or death. Everyone will eventually die, but being able 
to identify specific causes which are likely to decrease a per- 
son's normal life span is important. The government can then 
take decisions on whether the mitigation or cost of reducing 
that risk level is warranted in terms of its incremental benefit 
to society as a whole. Rather than focusing on one source of 
risk and regulating it to oblivion, the policy approach adopted 
is to regulate risks to a comparable level of consequence. This 
approach recognizes that there is a tradeoff between the uti- 
lity gain due to increased life expectancy and the utility loss 
due to decreased consumption (Rabl 2005), and so a rational 
policy should seek to optimize this tradeoff, i.e., maximize 
the total life-time utility. The general scientific framework 
or methodology for evaluating risks in a quantitative manner 
is well established, with the issue of how best to apply/tailor 
it to the context or specific circumstance still being in vary- 
ing stages of maturity. An important limitation is the lack of 
complete quantitative data needed to exercise the assessment 
models, along with the realization that such data may never 
be entirely forthcoming in many application areas. 

The occurrence, and then avoidance, of environmental 
risks are not deterministic events, and have to be considered 
in the context of probability. In Sect. 2.6, the notion of re- 
lative probabilities was described, and Table 2.10 provided 
values of relative risks of various causes of deaths in the 
U.S. However, such statistics provide little insight in terms 
of shaping a policy or in defining a mandatory mitigation 
measure. For example, 24% die of cancer, but what specifi- 
cally led to this cancer in the first place cannot be surmised 
from Table 2.10. For risk assessment to be useful practically, 
one has to consider the root causes of these cancers and, if 
possible, reduce their impact on just those individuals who 
are most at risk. Hence, environmental risks are stated as in- 
cremental probabilities of death to the exposed population 
(referred to as statistical deaths) which is much more mea- 
ningful. The USEPA and the Food and Drug Administration 
(FDA) restrict the amount of chemicals or toxins a person 
can be exposed to, based on some scientific studies reflecti- 
ve of the best current knowledge available while presuming 
an incremental probability range. USEPA has selected toxic 



Table 1 2.9 Activities that increase risk by one in a million. (From Wil- 
son 1979) 



Activity 


Type of risk 


Smoking 1 .4 cigarettes 


Cancer, heart disease 


Drinking Vi 1 of wine 


Cirrhosis of the liver 


Spending 1 h in a coal mine 


Black lung disease 


Living 2 days in New York or Boston 


Air pollution 


Travelling 300 miles by car 


Accident 


Flying 1,000 miles by jet 


Accident 


Traveling 10 miles by bicycle 


Accident 


Traveling 6 min in a canoe 


Accident 


Living 2 summer months in Denver (vs sea 
level) 


Cancer by cosmic 
radiation 


Living 2 months with a cigarette smoker 


Cancer, heart disease 


Eating 40 tablespoons of peanut butter 


Liver cancer 


Living 50 miles within 5 miles of a nuclear 
reactor 


Accidental radiation 
release 



exposure levels which pose incremental lifetime cancer risks 
to exposed members of the public in the range of 10"''-10""', 
or one extra death due to cancer from a million to 10,000 
exposed individuals. If one considers the lower risk level, 
and assuming the US population to be 300 million, one can 
expect 300 additional deaths, which spread over a typical 
70 year life expectancy would translate to about four extra 
deaths per year. Table 12.9 assembles activities that increa- 
se mortality risk by one in a million, the level assumed for 
acceptable environmental risk. Risks from various activities 
such as smoking, living in major cities, flying are shown; 
these should be taken as estimates and may be contested or 
even modified over time. Note that this is a dated study (30 
years or more), and many of the numbers in the table may no 
longer be realistic. Conditions change over time (for exam- 
ple, living in New York city or Boston has much less adverse 
air pollution risk these days), and periodic revision of these 
risks is warranted. 

Dose response models were introduced earlier in 
Sect. 1.3.3 and treated more mathematically in Sects. 10.4.4 
and 11.3.4. The linear slope of the dose-response curve of a 
particular toxin is called the potency factor (PF), and is defi- 
ned as (Masters and Ela 2008): 

Incremental lifetime cancer risk /n ^x 

Potency factor = — . . ., . ^ , . — —- (12.6) 

Chronic daily intake (mg/kg-day) 

where chronic daily intake (CDI) is the dose averaged over 
an entire lifetime per unit kg of body weight. Rearranging 
the above equation: 

Incremental lifetime cancer risk = CDI x PF (12.7) 

The USEPA maintains a database on toxic chemicals cal- 
led the Integrated Risk Information Systems (IRIS) where 
information on PF can be found. Table 12.10 assembles to- 
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Table 12.10 Toxicity data for selected 
www. epa. go v/iris) 


carcinogens. (USEPA website, 


Chemical 


Potency factors 


(mg/day/kg)-' 




Oral route 




Inhalation route 


Ar.senic 


1.75 




50 


Benzene 


2.9x10-2 




2.9x10-2 


Chloroform 


6.1x10-' 




8.1x10-2 


DDT 


0.34 




- 


Methyl chloride 


7.5x10-3 




1.4x10-2 


Vinyl chloride 


2.3 




0.295 



xicity data of selected carcinogens. Note that the statistics 
in Table 2.10 are based on actuarial data which may be ta- 
ken to be accurate. However, data in Table 12.10 is based 
on toxicological studies, usually with some assumptions on 
model structure and parameter values. In that sense, they are 
estimates which may contain large variability because of the 
uncertainties inherent in the extrapolations: (i) from lab tests 
on animals to human exposure, and (ii) from high dosage 
levels to the much lower dosage levels to which humans are 
usually exposed to. 

Example 12.3.2:' Kisk assessment of chloroform in drin- 
king water 

Drinking water is often disinfected with chlorine. Unfortu- 
nately, an undesirable byproduct is chloroform which may 
be cancerous. Suppose a person weighing 70 kg drinks 2 1 
of water with chloroform concentrations of 0.10 mg/1 (the 
water standard) every day for 70 years. 

(a) What is the risk of this individual? 

From Table 12.10, the potency factor PI for oral route = 

6.1xl0-Mmg/day/kg)-'. 

The daily chronic intake CDI = (0.10 mg/1) • (2 1/day)/ 

70 kg= 0.00286 mg/kg-day 

Thus, the incremental lifetime risk=CDIxPF= 

0.00286X 6.1 X 10-'= 17.4x10-'' i.e., the extra risk over 

a 70 year period is only 17 in one million. 

(b) How many extra cancers can be expected per year in a 
town with a population of 100,000 if each individual is 
exposed to the above risk? 

Expected number of extra cancers per year 

17.4 1 
= (100,000) ^ ■ — = 0.024 cancers/year 

Clearly this number is much smaller than the total num- 
ber of deaths by cancer of all sorts, and hence, is lost in the 
"background noise". This example serves to illustrate the 
statement made earlier that it is difficult, if not impossible, 
to attribute an individual death to a specific environmental 
effect. ■ 

Another problem with environmental risk assessment is 
that health and safety are moral sentiments (like freedom. 



' From Masters and Ela (2008) by © permission of Pearson Education. 



peace and happiness), and are not absolutes which can be 
quantified impartially. They are measured intangibly by the 
absence of their undesirable consequences which are also 
difficult to quantify (Heinsohn and Cimbala 2003). This le- 
ads to much controversy even today from various stakehol- 
ders about environmental risk assessment in general-skep- 
ticism from scientists about the ambiguous and uncertain 
dose-response relationships used, to certain segments of the 
population expressing overblown concerns, and other seg- 
ments feeling that risks are over-stated. 

Yet, another issue is that risks have to be distinguished 
by whether they are voluntary or involuntary. Recreational 
activities such as skiing or riding a motorcycle, or lifestyle 
choices such as smoking, can lead to greatly increased risks 
to the individuals, but these are choices which the individual 
makes voluntarily. The government can take certain partial 
mitigation measures such as mandating helmets or increa- 
sing the tax on cigarettes. However, there are some volunta- 
ry risks (such as living in New York City, for example — see 
Table 12.9) which the government can do little about. Invo- 
luntary risks are those which are imposed on individuals be- 
cause of circumstances beyond their control (such as health 
ailments due to air pollution), and people tend to view these 
in a more adversarial manner. 

A final issue is risk perception which greatly influences 
how an individual views the threat, and this falls under sub- 
jective probability (discussed in Sect. 2.6). Figure 2.31 cle- 
arly demonstrates the wide disparity in how different profes- 
sionals view the adverse impact of global warming on gross 
world product. At an individual level, the perception may be 
greatly influenced by either the lack of control on the event 
or the uncertainty surrounding the issue or by excessive me- 
dia propaganda. For example, natural risks such as earthqua- 
kes or floods are accepted more readily (even though they 
cause much more harm) than man-made ones (such as plane 
crashes). Further, the uncertainty surrounding risks of radia- 
tion exposure to people living close to nuclear power plants 
or high-tension electric transmission lines has much to do 
with the current overblown concerns due to lack (or percei- 
ved lack) of scientific knowledge. When the bearers of risk 
do not share the cost of reducing the risk, extravagant reme- 
dies are apt to be demanded. 



12.3.4 Other Areas of Application 

This section will provide brief literature reviews on the ap- 
plication of risk analysis to different disciplines, 
(a) Engineering 

DeGaspari (2002) describes past and ongoing activities by 
the American Society of Mechanical Engineers (ASME) on 
managing industrial risk, and quotes experts as stating that: 
(i) risk analysis with financial tools can benefit a company's 
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bottom line and contribute to safety, (ii) a full quantitative 
analysis can cost 10 times as much as a qualitative analy- 
sis, and (iii) fully quantitative risk analysis provides the best 
bet for optimizing plant performance and corporate values 
for the inspection/maintenance investment while addressing 
safety concerns. Lancaster (2000) investigates the major ac- 
cidents in the history of engineering and gives reasons why 
they occurred. The book gives many statistics for different 
types of hazards and cost for each type of disaster. Additio- 
nally, chapters on human error are also included. 

There is extensive literature on risk analysis as applied to 
nuclear power plants, nuclear waste management and trans- 
portation, as well as more mundane applications in mecha- 
nical engineering. A form of risk analysis that is commonly 
used in the engineering field is reliability analysis. This par- 
ticular analysis approach is associated with the probability 
distribution of the time a component or machine will operate 
before failing (Vose 1996). Reliability has been extensively 
used in mechanical and power engineering in general, and in 
the field of machine design in particular. It is especially use- 
ful in modeling the likelihood of a single component of the 
machine failing, and then deducing the failure risk of several 
components placed in series or parallel. Reliability can be 
viewed as the risk analysis of a system due to mechanical or 
electrical failures, whereas traditional risk analysis in other 
areas deals with broader scenarios. 

Risk analysis has also been adopted in several other 
fields, for example, during building construction (McDowell 
and Lemer 1991). In this application, risk analysis deals with 
cost and schedule: 

(i) Cost risk analysis is modeled as a discrete possible 
event, where the cost of the building is compared to a pay- 
back period, (ii) Schedule risk analysis deals with the con- 
nection between tasks that influence the construction time. 
Often, penalties must be paid if a building is not completed 
within the stipulated time period, 
(b) Business 

Risk analysis has found extensive applications in the busi- 
ness realm, where several solutions may be posed, but only 
one is the best possible scenario since competing solutions 
always involve risk. In a marketing application, a sample can 
be taken from a random population, and through risk analy- 
sis and modeling, a marketing campaign can be designed. 
From a marketing standpoint, a company can identify the 
kind of campaign the public best responds to, and alter their 
marketing accordingly. Risk assessment techniques are also 
commonly employed in the business realm in order to help 
make important decisions such as whether to invest in a ven- 
ture, or where to optimally site a factory or business. Such 
techniques are often rooted in financial modeling, where the 
risk is directly related to the monetary pay-off in the end. 
There are four major categories of decisions in the business 
world that utilize risk assessment (Evans and Olson 2000): 



(a) acceptance or rejection of a proposal based on either 
net present value or internal rate of return; (b) selection of 
the best choice among mutually exclusive alternatives; for 
example selecting a fuel source among wood, oil, or natural 
gas would dependent on several factors such as price, availa- 
bility, and growth rate; (c) selection of the best choice among 
non-mutual alternatives; (d) decisions containing a degree of 
uncertainty which involve calculating the expected opportu- 
nity loss and return to risk ratios, by creating decision trees 
and using Monte Carlo simulation techniques. 

(c) Human Health, Epidemiology & Microbial 
Risk Assessment 

Risk assessment is also commonly used when human health 
concerns are a factor. Haas et al. (1999) outline the primary 
areas where risk assessment is applied in health situations, 
and give a process for performing a risk assessment. This 
area of study is concerned with the impact of exposure to 
defined hazards on human health. Epidemiology, which is 
a subset of human health, is the "study of the occurrence 
and distribution of disease and associated injury specified by 
person, place, and time" (Haas et al. 1999), while microbial 
assessment is concerned only with the disease and its oppor- 
tunity to spread. 

The risk assessment process in health situations consists 
of four steps: (i) Hazard Identification which describes the 
health effects that are the result of human exposure to any 
type of hazard; (ii) Dose-Response Assessment which corre- 
lates the amount of time of the exposure to the rate of inciden- 
ce of infection or sickness; (iii) Exposure Assessment which 
determines the size and nature of the population that was 
exposed to the hazard, and also how the exposure occurred, 
the amount, and the total elapsed time of exposure; and (iv) 
Risk Characterization which integrates the information from 
the above steps to calculate the implications for the general 
public's health, and calculate the variability and uncertainty 
in the assessment. A book chapter by Gerba (2006) deals spe- 
cifically with health-based and ecological risk assessment. 

(d) Extraordinary Events 

Several agencies such as ER^., OSHA, NIOSH, ACGIH, 
ANSI have developed quantitative metrics such as PEL, TLV, 
MAC, REL (see for example, Heinsohn and Cimbala 2003) 
which specify the threshold or permissible levels for short- 
term and long-term exposure. The bases of these metrics are 
similar in nature, but these metrics differ in their threshold 
values because they target slightly different populations. For 
example, NIOSH is more concerned with the workplace en- 
vironment (where the work force is generally healthier), whi- 
le ER\'s concern is with the general public (that includes the 
young and elderly as well, and under longer exposure times). 
NIOSH has developed TLV-C (threshold limit values- 
ceiling) guidelines for maximum chemical concentration 
levels that should never be exceeded in the workplace. ER\ 
defines extreme events as once-in-a-life time, non-repetitive 
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or rare exposure to airborne contaminants for not more than 
8 h. Though only "chemicals" are mentioned, the definition 
could be extended to apply to biological and radiological 
events as well. Consequently, EPA formed a Federal Adviso- 
ry Committee with the objective of developing scientifically 
valid guidelines called AEGLs (Acute Exposure Guideline 
Levels) to help national and local authorities as well as pri- 
vate companies deal with emergency planning, prevention 
and response programs involving chemical releases. The 
AEGL program further defines 3 levels (Level 1 thresholds 
for no adverse effects; Level 2 thresholds with some adver- 
se effects; and Level 3 with some deaths) and also defines 
acute exposure levels at different exposure times: 10 min, 
30 min, 60 min, 4 h and 8 h. Another parallel effort is aimed 
at developing a software program called CAMEO (Compu- 
ter- Aided Management of Emergency Operations) to help in 
the planning and response to chemical emergencies. It is to 
be noted that exposure levels in CAMEO are different from 
AEGLs, and do not include exposure duration. 

Recent world events (such as 9/11) have spurred lea- 
ding American engineering societies (such as IEEE, ASCE, 
ASME, ASHRAE,...) as well as several federal and state 
agencies to form expert working groups with the mission to 
review all aspects of risk analysis as they apply to critical 
infrastructure systems under extreme man-made events. The 
journal of Risk Analysis devoted an entire special issue (vol. 
22, no 4., 2002) with several scholarly articles on the role of 
risk analysis applied to this general problem, on government 
safety decisions, on how to use risk analysis as a tool for re- 
ducing risk of ten'orism, and on the role of risk analysis and 
management. Studies of a more pragmatic nature have been 
developed by several organizations dealing with applying 
risk analysis and management methodologies and software 
specific to extreme event risk to critical infrastructure. Seve- 
ral private, state and federal entities and organizations have 
prepared numerous risk assessment guidance documents. 
ASTM and ASCE have developed guides specific to natur- 
ally occurring extreme events applicable to buildings and 
indoor occupants from a structural viewpoint. Sandia Natio- 
nal Laboratory has numerous risk assessment methodology 
(RAM) programs for both terrorist and natural events; one 
well know example, is the RAMR^RT software tool (Hunter 
2001) which performs property analysis and ranking for Ge- 
neral Services Administration (GSA) buildings due to both 
natural and man-made risks. It allows assessing risks due to 
terrorism, natural disaster and crime in federal buildings na- 
tionwide by drawing on a database of historic risks for dif- 
ferent disasters in different regions in the U.S. It can also be 
adapted to other types of critical facilities such as embassies, 
school systems and large municipalities. 

There has also been a flood of documents on risk analysis 
and mitigation related to extreme events in terms of indoor 
air quality (lAQ) risks to building occupants (reviewed in a 



report by Bahnfleth et al. 2008). The current thinking among 
fire hazard assessment professionals (which is perhaps the 
most closely related field to extreme event lAQ) is that, since 
the data needs for the risk evaluation model are unlikely to 
be known with any confidence, it is better to adopt a relative 
risk approacli, where the building is evaluated based on cer- 
tain pre-selected set of extreme event scenarios, rather than 
an absolute one (Bukowski 2006). Several semi-empirical 
methods to quantify the relative risk of building occupants 
to the hazard of a chemical-biological-radiological (CBR) 
attack have been proposed. For example, Kowalski (2002) 
outlines a method which involves considering four separate 
issues (hazard level, number of occupants, building profile 
and vulnerability), assigning weights between 0-100% for 
each of them, and deducing an overall weighted measure 
of relative risk for the entire building. Appendix C of the 
ASHRAE guidance document (ASHRAE 2003) describes 
a multi-step process involving defining a building category 
(based on factors such as number of occupants, number of 
threats received, time to restore operation, monetary value of 
the building. . .), assigning relative weights of occupant expo- 
sure, assigning point values for severity, determining sever- 
ity level in each exposure category, and finally calculating 
an overall score or rank from which different risk reduction 
measures can be investigated if a critical threshold were to be 
exceeded. Example 12.2.7 illustrated a specific facet of the 
overall methodology. 



1 2.4 Case Study Examples 

1 2.4.1 Risk Assessment of Existing Buildings 

This section will present a methodology for assessing risk in 
existing large office buildings^. The conceptual quantitative 
model is consistent with current financial practices, and is 
meant to provide guidance on identifying the specific risks 
that need to be managed most critically in the building under 
consideration. This involves identifying and quantifying the 
various types and categories of hazards in typical buildings, 
and proposing means to deal with their associated uncertain- 
ties. The methodology also explicitly identifies the vulnera- 
bilities or targets of these hazards (such as occupant safety, 
civil and operating costs, physical damage to a building and 
its contents, and failure of one or several of the major building 
systems), as well as considers the subtler fact that different 
stakeholders of the building may differ in their perception as 
to the importance of these vulnerabilities to their businesses. 
Finally, the consequences of the occurrence of these risks 
are quantified in terms of financial costs consistent with the 
current business practice of insuring a building and its oc- 



Based on a paper by Reddy and Fierko (2004). 
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cupants. The emphasis in this example is on presenting the 
conceptual methodology rather than accurate quantification 
of the risks as they pertain to an actual building. 

Similar to a business portfolio manager advising a client 
on how to allocate funds for retirement savings based on the 
client's age and risk tolerance, the person analyzing risk in a 
building must first consider the type of stakeholder (say, the 
owner or the tenant in a leased building scenario). The owner 
may be more concerned with the risk to the civil construc- 
tion and to the basic amenities which are his responsibilities, 
whereas the tenant may be more concerned with the cost of 
replacing the business equipment along with lost revenue 
should a deleterious event occur. Finally, both may be liable 
to be sued by the occupants if they were to be harmed. Hence, 
one needs to start with the stakeholder. 

There are several stakeholders in a building. They include 
anyone who has an interest in the design, construction, finan- 
cing, insurance, occupancy, or operation of a building. This 
list includes, but is not limited, to the following groups of 
people: (i) building owner/landlord, (ii) building occupants/ 
tenants, (iii) architects/engineers, (iv) local code officials, 
(v) township/city representatives, (vi) neighbors, (vii) con- 
tractors/union officials, (viii) insurance companies, and (ix) 
banks/financing institutions. Let us consider only the first 
group: building owner. This example focuses on the risks as- 
sociated with an occupied leased commercial building, and 
not with the design, code compliance, and construction of 
such a structure. 

The building owner or landlord is defined as the person 
who finances and operates the building on a daily basis. This 
person is concerned more with the physical building and the 
financial impacts on the operation of such a structure. The 
concerns of the building owner lie in the areas of operating 
costs and the physical building. The building owner is assu- 
med not be a regular occupant of the building, rather this role 
will be performed from a remote location. On the other hand, 
the occupants of the building are defined as the people who 
work in the building on a daily basis. A typical occupant is 
a company that leases space from the building owner. Alt- 
hough the occupants are concerned with the physical build- 
ing to some degree, this group is much more concerned with 
the well-being of its employees and the company-owned 
contents inside of the building. This group is also less sen- 
sitive to financial impacts on the operation of the building\ 
since it is assumed that their lease rate is not sensitively tied 
to fluctuations in building operations cost. Additionally, the 
tenant can be viewed as a part-time occupant who has the op- 
tion of leaving the building after their lease has expired. The 
situation where the occupant is also the owner of the building 
is not be considered in this risk assessment study, although 
the proposed model can be altered to fit these conditions. 



Table 12.11 Description of different hazard categories which impact 

specific building elements 

Building Hazard Description 

element category 

Civil Natural Natural events that affect the civil constructi- 

on of a building such as earthquakes, floods, 
and storms 
Intentional Actions that are purposely committed and 
designed to harm the physical building such 
as bombings and arson 



Accidental Actions that are not committed intentionally 
but have serious results such as unintentional 
fires and accidents 

Direct Crime Actions that only affect the occupants and 

Physical not the physical building, such as robbery or 

assault 
Terrorist An act of terrorism that is intended to 

affect the occupants only, such as hostage 
situations 
Bio & lAQ Contamination of air in order to harm build- 
ing occupants 



Cybernetic Intentional Sabotage, hacking in IT networks 



MEP 

Systems 



Accidental Not committed on purpose, but which result 
in harm, such as computer crashes 

Accidental The failure of mechanical, electrical or 

plumbing systems as well as telecom, fire 
safety equipment 



Operation Unantici- 
Services pated 



Impact of the fluctuations of utility prices 
and operation and maintenance that are 
required to operate the building 



^ The building occupants are not totally insensitive to operating costs 
since they would generally pay for the utilities. 



The methodology consists of the following aspects: 

(a) Hazard Categories and Affected Building Elements 

Different building elements that can be damaged should dif- 
ferent hazards occur need to be identified. The five building 
elements that are susceptible are shown in Table 12. 11 along 
with their hazard categories. 

(b) Vulnerable Targets in a Building 

Targets are those which are affected by different building ha- 
zards, namely occupants, property replacement and revenue/ 
loss. Table 12.12 lists these along with sub-targets. 

(c) Applicability Matrix 

The hazards identified have an impact on the stakeholders 
listed previously in different ways. One set of stakeholders 
may be more sensitive to a certain risk event than to others. 
The risk analysis methodology explicitly considers this 
linkage between targets and sub-targets to affected building 
elements and associated hazard categories. This mapping is 
done by defining an applicability matrix which depends on 
the type of stakeholder for whom the risk analysis is being 
performed. The applicability matrix is binary in nature (i.e., 
numerical values can be or 1, with implying "not ap- 
plicable", and 1 implying "applicable"). Table 12.12 depicts 
such an applicability matrix (AM) from the perspective of 
the owner of a leased commercial building. For example, the 
building owner may not view revenue loss vulnerabilities 



1 2.4 Case Study Examples 



385 



Table 12.12 Applicability Matrix (AM) from the perspective of the owner of a leased building 



Vulnerable Sub-targets Building Civil structure (building 



Direct physical 



Cybernetic 



MEP 



Operation 



targets 



elements specific) (individual) systems services 

Hazard Natural Intentio- Acciden- Crime Terrorist Bio & Intentional Accidental Accidental Unantici- 

categories nal tal lAQ pated 



Occupants 


Short-term 


I 1 





1 


1 


1 
















Long-term 


1 1 





1 


1 


1 














Property 
replace- 


Physical 
building 


1 1 


1 








1 














ment 


Contents 


1 1 


1 











I 





1 







Indoor 
environment 


1 


1 








1 


I 





1 






Building 

systems 
Revenue Operating cost 
loss 







1 I 







Cost of 
utilities 





























1 


Lost business 


























1 


I 



due to lost business or cost of utilities to be his responsibility. 
Hence, the corresponding cells have been assigned a value of 
zero in most cases). 
(d) Importance Matrices 

One approach to modeling the importance with which a par- 
ticular stakeholder views a specific target type or sub-tar- 
get is the continuous utility function approach described in 
Sect. 12.2.5. Another is to do so using fuzzy theory (see, for 
example, McNeill and Thro 1994). Fuzzy numbers provided 
by the stakeholder are used along with their uncertainty cha- 
racterized by a symmetrical triangular membership function. 
The importance matrix IMl for stakeholders versus target 
type on one hand, and that for target type and sub-target (cal- 
led IM2) are shown in Fig. 12.14. From a building owner 
perspective, property replacement is more crucial than say, 
occupant hazards, and the fuzzy values of 0.6 and 0.1 for IMl 
shown reflect this perception. The tenant is likely to view the 
relative importance of these two targets differently, which 
was the reason for our initial insistence that one should start 
first and foremost with the concerned stakeholder. Further, 
the building owner is likely to be more concerned with the 
short-term, as against their long term exposure (since occu- 
pants change and it is more difficult to prove the owner's 
culpability), which is translated into fuzzy values of 0.9 and 
0.1 respectively in Fig. 12.14 corresponding to IM2. The nu- 
merical values have no factual or historic basis nor have they 
been deduced from surveys of a specific stakeholder class. 
They are merely illustrative numbers for the purposes of this 
example. Note that these fuzzy numbers are conditional, i.e., 
they add up to unity at each level. The membership function 
is characterized by only one number (representing a plus/mi- 
nus range around the estimate) since a symmetric triangular 
function is assumed in order to keep the data gathering as 
simple as possible. The actual input fuzzy data needs to be 
refined over time and be flexible enough to reflect, first the 
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Fig. 12.14 Overall Risk Assessment Tree Diagram by Target Category 
with numerical values of the Importance Matrix (IMl) between stake- 
holder (in this case, the building owner) and target types, and of Import- 
ance Matrix (IM2) between targets and sub-target types. The numbers 
are fuzzy values (which sum to unity) with associated uncertainty in 
parenthesis (assuming symmetric triangular membership) 



actual perception of a class of stakeholder, and second, that 
of a specific individual stakeholder depending on his prefe- 
rences, circumstances, and special concerns, 
(e) Hazard Event Probabilities 

Hazard categories proposed have been described above, and 
also summarized in Table 12.12. One needs finer granularity 
in the hazard categories by defining hazard events since each 
of them need different risk management and alleviation mea- 
sures. The various event categories assumed are shown in 
Table 12.13. Note from Fig. 12.15 that natural civil hazards 
can result from five different events (hurricane, earthquake, 
tornado, flood and windstorm). The list of events is conside- 
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Table 12.13 Event probabilities P. . (absolute probability of occurrence 
per year) for different hazard events and associated costs 
Affected Hazard Hazard event Probability Associated 

building category (j) p. cost {C^) 

element (i) ($/year) 

Civil Natural Hurricane 0.005 

structure Earthquake 0.0005_ 

Tornado 0.001 

Flood 0.01 



Accidental Fire 



0.005 



Others 



0.001 



Direct Criine 

physical 



Robbery 



0.01 



Assauh 0.01 

Homicide 0.005 

Rape 0.005 

Terrorist Hostage 0.008 

Hijacking 0.005 

Murder 0.003^ 

Bio & lAQ Intentional 0.02 



Accidental 0.01 

Sick building 0.03 

Cybernetic Intentional Hacking/out- 0.01 
side 



I_j (building) 





Winter storm 


0.0005 


Intentional 


Arson 


0.005 




Bombing 


0.002 




Terrorism 


0.003 



Ij (occupant) 







Hacking/in- 
side 


0.02 






Industrial 
sabotage 


0.02 






Accidental 


Crash 


0.01 






Power outage 


0.08 






Power surge 


0.07 




MEP 

Systems 


Accidental 


HVAC/Plum- 
bing 


0.003 


c 




Electrical 


0.002 






Telecom 


0.001 






Security 
Fire alarm 


0.002 
0.001 







BMS 


0.002 




Increase in 

Operation 

Services 


Unantici- 
pated 


Fuel price 
Elec. Price 
Utility cost 


0.01 

0.008 

0.005 


c 




Labor cost 


0.007 





red to be exhaustive; a necessary condition for framing all 
chance events (as stated in Sect. 12.2.3). This categorization 
should be adequate to illustrate the risk assessment methodo- 
logy proposed. It is relatively easy to add, or remove, speci- 
fic hazard events, or even regroup some of them, which adds 
to the flexibility of the proposed methodology. 

Each of the events will have an uncertainty associated with 
them (as shown in Table 12.13). These are best represented 
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Fig. 12.15 Tree Diagram for various hazards affecting the civil ele- 
ment from the perspective of building owner (numerical values of ab- 
solute probabilities correspond to the case study) 



by a hazard event probabilities or absolute annual probability 
(contrary to conditional probabilities, these will not sum to 
unity) of occurrence of certain hazard events, and a distribu- 
tion to characterize its uncertainty. The absolute probabilities 
assigned to specific hazard events, will depend on such con- 
siderations as climatic and geographic location of the city, 
location of building within the city, importance and type of 
building. These could be obtained through the research of 
historical records, as was done for the RAMPART database 
(Hunter 2001). In order to keep the assessment methodology 
simple, the variability associated with these event probabi- 
lities has been overlooked. Monte Carlo analysis could be 
used to treat such variability, as illustrated in Sect. 12.2.7. 
(f) Hazard Event Costs 

The consequences, or cost implications, of the occurrence of 
different hazard events from the perspective of the stakehol- 
der (in this case, the building owner) need to be determined 
in order to complete the risk assessment study. The numeri- 
cal values of these costs are summarized in the last column 
of Table 12.14, and are described below. Replacement costs 
for specific hazard events are difficult to determine, and 
more importantly, these costs are not reflective of the actual 
cost incurred by the building owner (unless he is self-insu- 
red). Most frequently, the building owner insures the build- 
ing along with its contents and occupants with an insurance 
company to which he pays annual premiums (say, I^^ for civil 
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Table 12.14 Building specific financial data assumed in the 


case study from the perspective of the building owner 


Description 


Symbol 


Assumed values 


Calculated values 


Building initial (or replacement) cost 


C: 


$ 15,000,000 




Net return on investment 


ROI 


15% per year 


$ 2,250,000/year 


Number of occupants 


N_p 


500 




Total amount of insurance coverage against 
occupant lawsuits 


C^w 


$ 10,000,000 




Occupant hazard insurance premium 


loH 


$ 200/occupant/year 


S 100,000/year 


Building hazard insurance premium 


Ibh 


(2%*Cj)peryear 


$ 300,000/year 


Insurance deductible 

- for building 

- for occupants 


I. 


(5%*C,) 

(5%*ClJ 


$ 750,000 (bldg) 

$ 500,000 (occupants) 


Aimual building maintenance and utility cost 


c 


(5% *C,) per year 


S 750,000/year 


Replacement cost of MEP equipment 


c 


$ 3,000,000 per year 




Cost to recover from computer software failure 


c.,. 


$ 50,000 





Note that the current methodology overlooks variability or uncertainty in these data. 



construction and 1^^^^ for occupants) whether or not a hazard 
occurs (see Table 12.14). Hence, the additional cost faced by 
the building owner when civil and/or direct physical hazards 
do occur is actually the insurance deductible I^. On the other 
hand, financial risks due to accidental MEP and cybernetic 
hazards (C^^^ and C^ |^) are considered to be direct expenses 
which the owner incurs whenever these occur. The monetary 
consequence of risk due to an unanticipated increase in ope- 
ration services (maintenance, utility costs,...) is taken to be 
^o&M which can be assumed to be a certain percentage of the 
total building cost that is spent yearly on operations, main- 
tenance and utility costs. Most of the above costs need to be 
acquired from the concerned stakeholder, 
(g) Computational Methodology 

Recall that a decision tree is a graphical representation of all 
the choices and possible outcomes with probabilities assigned 
according to existing factual information or professional in- 
put. Figure 12.14 shows the stakeholder's importance towards 
various targets (characterized by the IMl matrix) and sub- tar- 
gets (characterized by IM2 matrix) shown in Table 12. 12. Next 
these sub-targets are mapped onto the hazard categories using 
the binary information contained in the Applicability Matrix 
(AM). Finally, each of the affected building elements and the 
associated hazard categories are made to branch out into their 
corresponding hazard events with the associated absolute pro- 
bability values. Once a probability is assigned to a specified 
risk event, a monetary value also needs to be associated with 
it which is representative of the cost of undoing or repairing 
the consequences of that event. By multiplying the monetary 
value and the probability for each hazard event, a characteriza- 
tion of the expected monetary value of risk becomes apparent. 
Addition of all these values will give an overall characteriza- 
tion of the building. Areas of high monetary risk can thus be 
easily identified and targeted for improvement. Mathematical- 
ly, the above process can be summarized as follows: 



Expected annual monetary risk specific to building element 

^ = ChJ2J2Jm (I'^lk • IM2k.n, • AMn,.i • pij) (12.8) 

k m i j 

where i the hazard category, j the hazard event, k the target 
and m the sub-target. 

Multiplication rules for fuzzy numbers are relatively simp- 
le and are described in several texts (for example, McNeill 
and Thro 1994). This is required for considering the propa- 
gation of uncertainty through the various sequential stages 
of the computation. Thus, in conjunction with computing the 
estimate of a risk, a range of numbers can also be computed 
reflective of the perceived importance to the stakeholder vis- 
a-vis specific targets and sub-targets, 
(h) Illustrative example 

A hypothetical solved example is presented to illustrate 
the entire methodology. A commercial building with 500 
occupants has been assumed with the numerical values for 
AM, IMl, IM2, and P shown in Table 12.12, Fig. 12.14 and 
Table 12.13. Financial inputs and assumptions are shown in 
Table 12.14. The replacement cost of the building (Cj) is as- 
sumed to be $ 15,000,000 with the ROI for the building ow- 
ner to be 15% per year. The gross annual income is assumed 
to be $ 3,400,000 per year, while annual expenditures include 
$ 300,000 (or 2%.Cj) as building hazard insurance premium, 
$ 100,000 per year (or $ 200 per occupant) as occupant insu- 
rance premium and $ 750,000 (or 5%. Cj) as building main- 
tenance and utility costs. MEP replacement cost is estima- 
ted to be $ 3,000,000, and cost to recover from a complete 
computer software failure is estimated to be $ 50,000. The 
insurance deductibles for civil structure and occupants are 
5% of C, and C, (where C, is the total insurance coverage 

1 Law ^ Law '^ 

against occupant lawsuits), namely $ 750,000 and $ 500,000 
respectively. 
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The results of the risk assessment are shown in Table 12.15 
at the whole building level. The building owner spends 
$ 1,150,000 per year, with a net income of $ 2,250,000. The 
monetary mean value of the total risks is $ 68,285, i.e., about 
3.0% of his net income. He may decide that this is within 
his tolerance threshold, and do nothing. On the other hand, 
he may calculate the risk based on the upper limit value of 
$ 102,415 shown in Table 12.15, and decide that 4.6% is ex- 
cessive. In this case, he would want some direction as to how 
to manage the risk. This would require information about in- 
dividual outcomes of various events. Figure 12.16 provides 



Table 12.15 Monetaiy risks on the five building 
various hazards considered ($/year) 


elements affected by 


Building 
Element 


Mean 


Lower limit 


Upper limit 


Civil 


$ 17,895 


S 10,170 


S 25,620 


Direct Physical 


$ 24,260 


$ 13,380 


$ 35,140 


Cybernetic 


$ 2,190 


$ 1,410 


$ 2,970 


M/E Failure 


$ 15,840 


S 6,270 


$25,410 


Operations 


S 8,100 


$ 2,925 


S 13,275 


Total 


$ 68,285 


$ 34,155 


$ 102,415 



$30,000 



$25,000 



$20,000 

(0 

>. $15,000 
$10,000 



$5,000 



$0 4 





9 


























— '■ 1 ^ — 


1 9 




— ' 1 




— —\ ' 1 


— 1 'i' 

1 1 1 

— '■ 1 '■ 1 


1 ^ ■ 1 ■ '■ 1 



S? 


r^ « 


« 




CD 


Direct 
ysical- 
errorist 


rect 
cal- 
AQ 


D 


> c 
O .2 

c 


c 


2 


E 


13 

2 


CD 
■D 


Q 


O 


Q'^o 




CD 


O 






.C |_ 


^ DQ 


> 


C 


^ 




Q. '^ 


Q. 


o 






^ 










> 




^ 










O 




0. 







o ^ 

C 



c 

0) 



o - 



E "D 



W o 

D. < 
LU 



CO 

r 


■o 


n 


CO 




Q. 


(" 


O 


n> 


'^ 


n 




O 


<r 




=) 



Fig. 12.16 Risk assessment results at the Hazard Category level along with uncertainty bands 
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Fig. 12.17 Risk assessment results at the Hazard Event level specific to the Direct Physical Hazard category along with uncertainty bands 
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further specifics about individual hazard categories while 
Fig. 12.17 shows details of the largest hazard (namely "Di- 
rect Physical Hazard"). One notes that the "Sick Building" 
hazard is prone to the highest risk followed by "Intentional" 
attacks. Thus, the subsequent step involving determining the 
specific mitigation measures to adopt (such as paying for ad- 
ditional security personnel, improving lAQ by implementing 
proper equipment,...) would fall under the purview of deci- 
sion-making. 



1 2.4.2 Decision Making Wliile Operating 
an Engineering System 

Supervisory control of engineering systems involves opti- 
mization which relies on component models of individual 
equipment. This case study describes a decision-making si- 
tuation involving an engineering problem where one needs 
to operate a hybrid cooling plant consisting of three chillers 
of equal capacity (two vapor compression chillers- VC and 
one absorption chiller-AC), along with auxiliary equipment 
such as cooling towers, fans, pumps,. . ."* The cooling plant is 
meant to meet the loads of a building over a planning horizon 
covering a period of 12 h. This is a typical case of dynamic 
programming (addressed in Sect. 7.10). The cooling loads of 
the building are estimated at the start of the planning horizon, 
and whatever combination of equipment deemed to result in 
minimum operating costs is brought on-line or switched off. 
The optimization needs to consider chiller performance de- 
gradation under part load operation as well as the fact that 
starting chillers has an associated energy penalty (a certain 
start-up period is required to reach full operational status). 
However, there is a risk that the building loads are impro- 
perly estimated which could lead to a situation where there 
is inadequate cooling capacity, which also has an associated 
penalty. Minimizing the operating cost and minimizing the 
risk associated with loss of cooling (i.e. inadequate cooling 
capacity) are two separate issues altogether; how to trade-off 
between these objectives while considering the risk attitude 
of the operator is basically a decision analysis problem. The 
structure of the engineering problem is fairly well known 
with relatively low stochasticity and ambiguity; this is why 
this problem would fall into the low epistemic and low aleo- 
tory category. 

(a) Modeling uncertainties 

There are different uncertainties which play a role, and they 
do so in a nested manner. A suitable classification of diffe- 
rent types of uncertainties is based on the nature of the sour- 
ce of uncertainty described at more length below, 
(i) Model-Inherent Uncertainty which includes uncertain- 
ties of the various component models used for dynamic 
optimization that arise from inaccurate or incomplete 



data and /or lack of perfect regression to the response 
model. Equipment model uncertainties (assumed addi- 
tive) can be dealt with by including an en^or term in the 
model. A general form of a multivariable ordinary least- 
squares (OLS) regression model with one response va- 
riable y and p regressor variables x can be expressed as: 



y(n,\) — X(n,p)P{p.l) + e(n,l)(0,cr ) 



(12.9) 



where the subscripts indicate the number of rows and 
columns of the vectors or matrices. The random error 
term e is assumed to have a normal distribution with 
variance a^ and no bias (mean is zero). Since the e terms 
in the models are assumed to be independent variables, 
their variances can be assumed to be additive. The in- 
dependent variable is the energy consumption estimates 
of electricity for the VC and associated components and 
gas for the AC. The model uncertainties expressed as 
CV-RMSE of the regression models for the total elec- 
tricity and gas use (P^^^^ and P^^^) for each of the five 
different combinations or groups (G1-G5) of equipment 
operation are shown in Table 12.16. 

(ii) Process-Inherent Uncertainty, due to control policies, 
are usually implemented by lower-level feedback con- 
trollers whose set point tracking response will generally 
be imperfect due to actuator constraints and un-modeled 
time-varying behavior, nonlinearities, and disturbances. 
The control implementation uncertainty on the chilled 
water temperature and the fan speed can be represen- 
ted by upper and lower bounds. Based on the typical 
actuator constraints for these two variables, the accu- 
racy of the temperature control is often in the range of 
0.5 ~ 1.5°C, while accuracy of fan speed control would 
be 2-5%. 

(iii) External Prediction Uncertainty which includes the 
errors in predicting the system driving functions such 
as the building thermal load profiles, and electricity 
rate profiles. Load prediction uncertainty can be trea- 
ted either as unbiased, uncorrelated Gaussian noise or 
as unbiased correlated noise (only the results of the 
former are presented here). The distributed deviations 
around the true value are assumed such that their vari- 
ance grows linearly with the look-ahead horizon. This 



Table 12.16 Summaiy of CV-RMSE values for the 
gression models for the hybrid cooling plant 


component re- 


Group 


Description of 
operating equipment 


CV-RMSE (%) 






Electricity use (P^ J 


Gas use (P ^) 


Gl 


OneVC 


4.0 


- 


G2 


One AC 


3.0 


5.0 


G3 


TwoVC 


5.7 


- 


G4 


One VC and one AC 


5.0 


5.0 


G5 


All three chillers 


5.7 


5.0 



Adapted from Jiang and Reddy (2007) and Jiang et al. (2007). 



VC vapor compression chiller, AC absorption chiller 
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corresponds to a quantity whose prediction performan- 
ce deteriorates linearly with time. Mathematically, this 
noise model can be expressed as 



yt,i =x,+,[l +e(0,o-,^)] 



(12.10a) 



o-/ 



where yt,i is the output variable at hour t and look-ahe- 
ad hourly interval i; x, is the input at hour t; L is the 
planning horizon (in our case, L = 12 h) and p is a sca- 
lar characterizing the noise magnitude. 
The dynamic optimization involved determining the opti- 
mal scheduling and control strategy which resulted in mini- 
mum energy cost for a specified hourly building cooling load 
profile over a 12 h period under a specified electric price sig- 
nal. The process-inherent uncertainty was combined with the 
equipment uncertainties by using a simple multiplier. This 
strategy was called optimal deterministic strategy (ODS). 
Different noise levels were investigated p={0, 0.05, 0.10, 
0.15, 0.2} though in practice, one would expect p=;0.05. For 
example, at the end of 12 h, the error bands represented by 
p = 0.2 would cover 20% of the mean value. Figure 12.18 



uncorrelated Gaussian noise 
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Fig. 12.18 Load prediction versus planning horizon for different noise 
levels for uncorrelated Gaussian noise in a typical hot day 



illustrates the extent to which the load prediction error keeps 
increasing with the look-ahead hour. 

Latin Hypercube Monte Carlo (LHMC) simulation with 
10,000 trials was then performed to the ODS operating stra- 
tegy under different diurnal conditions. The results are sum- 
marized in Table 12.17 from which the following observati- 
ons can be made: 
(12.10b) (i) As expected, the coefficient of variation of the stan- 
dard deviation (CV-STD) values of operating cost over 
the 10,000 trials increase with increasing model error. 
When the model error is doubled, the CV-STD values 
are approximately doubled as well. Also, as expected, 
CV-STD values of the operating cost increase with in- 
creasing load prediction uncertainty, 
(ii) The model-inherent uncertainty is relatively more im- 
portant than the load prediction uncertainty. For exam- 
ple, if the uncertainty profile is changed from (e , 0.05) 
to (e^^ ,^, 0.05), the CV-STD of the minimal plant ope- 
rating cost is increased from 1.7 to 3.2% (Table 12.17). 
On the other hand, if the load prediction error is dou- 
bled while the model uncertainty is kept constant, the 
CV-STD of operating cost is only increased from 1 .7 to 
2.1%. However, the model-uncertainty has less effect 
on probability of loss of cooling (PLC) capability than 
the load prediction error. 
(iii) Under a practical uncertainty condition of (e , 0.05), 
model-inherent uncertainty and load prediction uncer- 
tainty seem to have little effect on the overall operating 
cost of the hybrid cooling plant under optimal operating 
strategy with the CV-STD being around 2%. 
(b) Decision analysis with multi-attribute utility frame- 
work 
The above sensitivity analysis revealed that different un- 
certainty scenarios have different expected values (EV) and 
CV-STD values for the cost of operating the hybrid cooling 
plant. The risk due to loss of cooling capability is also linked 
with the uncertainty in the building cooling load prediction. 
The probability of loss of cooling capability is determined 
under each specified uncertainty scenario and diurnal con- 
dition, with LHMC simulation used to generate numerous 
trials of the building cooling load profile. Subsequently, 
for each trial, one can again generate numerous cases using 



-p=0.2(-<T) 



Table 12.17 Sensitivity analysis results under ODS (optimal detenninistic strategy) assuming uncorrelated Gaussian noise and with 10,000 
Monte Carlo simulation trials 



('^,„' e,) 




(0,0) 


(e„, 0) 


(e„, 0.05) 


(£„,, 0.10) 


(£„,0.15) 


(e„, 0.2) 


K,r 


0) 


(«„,2. 


0.05) 


K.2' 0-2) 


Operating 


EV($) 


2,268 


2,267 


2,268 


2,269 


2,271 


2,274 


2,267 




2,268 




2,273 


cost of 


STD ($) 





35.4 


39.1 


47.4 


59.3 


70.8 


70.7 




73.3 




93.2 


plant 


CV-STD (%) 





1.6 


L7 


2.1 


2.6 


3.1 


3.1 




3.2 




4.1 


PLC (%) 













0.17 


1.9 


4.5 












4.6 



Note: (£^^^, £|) describes the uncertainty profile, £_^^ is the actual model uncertainty in the hybrid cooling plant, while £_^^ , implies that the uncer- 
tainty of the model and the process control combined is two times the actual model uncertainty; e^ is the load prediction uncertainty with values 
being assumed to be 5, 10, 15 and 20%. EV is the expected value of operating cost, SD is the standard deviation of operating cost and PLC is the 
probability of loss of cooling 
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Table 12.18 Sensitivity analysis results under RAS (risk averse 
simulation trials 


; strategy) < 


issuming uncorrelated Gaussian 


noise 


and with 10,000 Monte Carlo 


(£,„, £,) (0, 0) 


(e^, 0) (e^, 0.05) 


(e», 


,0.10) 


(e„ 


,0.15) 


K 


, 0.2) 


K 


...0) 


(e. 


..2' 0.05) 


(e„.2.0-2) 



Operating 
cost of plant 



PLC (%) 



EV(S) 


2,919 


2,922 


2,922 


2,923 


2,923 


2,924 


2,922 


2,922 


2,924 


STD ($) 





37.6 


39.6 


45.3 


53.4 


62.9 


75.2 


76.2 


90.6 


CV-STD 

(%) 





1.3 


1.4 


1.6 


1.8 


2.2 


2.6 


2.6 


3.1 
































LHMC to simulate the effect of component model errors on 
the energy consumption of the chillers under ODS. This nes- 
ted uncertainty computation thus captures the effect of both 
sources of uncertainty, namely that of load prediction and 
due to the component models plus the process control uncer- 
tainty. The total number of occurrences of loss of cooling ca- 
pability is divided by the total number of simulation trials to 
yield the probability of the loss of cooling capability (PLC) 
under specified uncertainty scenario and optimal operating 
strategy. PLC values are also shown in Table 12.17. 
(c) Risk-averse strategy 

Certain circumstances warrant operating the cooling plant 
differently than that suggested by the optimal deterministic 
strategy (ODS). For example, the activities inside the building 
may be so critical that they require the cooling load be always 
met (such as in a pharmaceutical company, for example). 
Hence, one would like to have excess cooling capability at 
all times even if this results in extra operating cost. One could 
identify different types of risk averse strategies (RAS). For 
example, one could bring an extra chiller online only when 
the reserve capacity is less than a certain amount, say 10%. 
Alternatively, one could adopt a strategy of always having a 
reserve chiller online; this is the strategy selected here. 

Sensitivity analysis results are summarized in Table 12.18. 
One notes that there is no loss of cooling under any of the un- 
certainty scenarios, but the downside is that operating cost is 
increased. In this decision analysis problem, three attributes 
are considered: EV, CV-STD of operating cost, and PLC. The 
CV-STD is a measure of uncertainty surrounding the EV. In a 
decision framework, higher uncertainty results in higher risk, 
and so this attribute needs to also figure in the utility func- 
tion. Since these three attributes do not affect and interact 
with each other, it is reasonable to assume them to be additive 
independent, which allows for the use of an additive multi- 
attribute utility function of the forms (similar to Eq. 12.4): 

UoDsiEV, CV - STD, PLC) = ki,oDsUi,oDsiEV) 
+ k2,ODsU2,ODsiCV - STD) + kxoDsUi,oDs{PLC) 

(12.11) 

Uras{EV, CV - STD, PLC) = ki^RAsUuRAsiEV) 

+ k2,RAsU2.RAsiCV - STD) + ki,RAsUi,RAs(PLC) 

(12.12) 



Table 1 2.1 9 Utility values for EV, CV-STD and PLC under ODS and 
RAS strategies nonnalized such that they are between and 1 represen- 
ting the worst and best values respectively 





Case 1 


Case 2 


Case 3 


Case 4 


Case 5 


(^,„' £,) 


K' 0) 


(e„,,0.05) (e„.,0.10) 


(e„,0.15) 


(£„, 0.2) 


ODS U,„„,(EV) 


1.00 


1.00 


1.00 


1.00 


1.00 


U:,ocs(CV- 
STD) 


0.88 


0.84 


0.68 


0.48 


0.28 


U3,oos(PLC) 


1.00 


1.00 


0.96 


0.58 


0.00 


RAS U„,,(EV) 


0.00 


0.00 


0.00 


0.00 


0.00 


U:...s(CV- 
STD) 


1.00 


0.88 


0.64 


0.32 


0.00 


U3.«.s(PLC) 


1.00 


1.00 


1.00 


1.00 


1.00 



where U is the utility value, EV is the expected value of to- 
tal operating cost over the planning horizon, CV-STD is the 
coefficient of variation of the standard deviation of the total 
operating cost from all the Monte Carlo trials, PLC is the 
probability of loss of cooling capability, and k denotes the 
three weights normalized such that: 



ki. 



and 



<-\,RAS 



^2,ODS 



^2,RAS 



<3.0DS 



1, 



<^3.RAS 



= 1. 



The benefit of using the additive utility function has already 
been described in Sect. 12.2.6. Further, risk neutrality, i.e., a 
linear utility function, has been assumed. The terms U^, U^ 
and U^ under both ODS and RAS are normalized by dividing 
them by the differences of worst value and best value of the 
three attributes respectively so that these values are bounded 
by values of and 1 (see Eq. 12.3). The normalized utilities 
of the three attributes at other levels ranging from worst to 
best can be determined by linear interpolation. Table 12.19 
lists some of utility values of the three attributes under five 
different combinations or cases of model and load prediction 
uncertainties. Hence, knowing the weight k, one can calcu- 
late the overall utility under different uncertainty scenari- 
os. However, it is difficult to assign specific values of the 
weights since they are application specific. If occasional loss 
of cooling can be tolerated, then k^ can be either set to zero or 
given very low weight. A general heuristic guideline is that 
the weight for expected value (A:^) be taken to be 2-4 times 



392 



1 2 Risk Analysis and Decision-IVlal<ing 



Fig. 12.19 Indifference plots 
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bute weights k| and k,. Thus, for 
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of k2 = 0.2, the ODS is superior 




0.35 


if the operator deems U(EV) to 
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greater than that of variabihty of the predicted operating cost 
(k^) (Clemen and Reilly 2001). 

Indifference curves can be constructed to compare diffe- 
rent choices. An indifference curve is the locus of all po- 
ints (or alternatives) in the decision maker's evaluation spa- 
ce among which he is indifferent. Usually, an indifference 
curve is a graph showing combinations of two attributes to 
which the decision-maker makes has no preference for one 
combination over the other. The values for the multi-attri- 
bute utility functions Uods{EV,CV - STD, PLC) and 
Uras(EV, CV — STD, PLC) can be calculated based on 
the results of Tables 12.17 and 12.18. The model uncertain- 
ty is found to have little effect on the probability of loss of 
cooling capability, and is not considered in the following 
decision analysis; fixed realistic values of e,„ shown in Ta- 
ble 12.16 are assumed. All points on the indifference cur- 
ves have the same utility, and hence, separate the regions 
of preference between ODS and RAS. These are shown in 
Fig. 12.19 for the five cases. These plots are easily generated 
by equating the right hand terms of Eqs. 12.1 1 and 12.12 and 
inserting the appropriate values of k. from Table 12. 19. Thus, 
for example, if k, is taken to be 0.2 (a typical value), one can 
immediately conclude from the figure that ODS is preferable 
to RAS for case 5 provided k^ > 0.375 (which in turn implies 
k3=l- 0.2-0.375=0.425). Typically the operator is likely 
to consider the attribute EV to have an importance level hig- 
her than this value. Thus, the threshold values provide a con- 
venient means of determining whether one strategy is clearly 
preferred over the other, or whether precise estimates of the 
attribute weights are required to select an operating strategy. 



Figure 12.19 indicates that for cases 1-3 (load predic- 
tion error of 0-10%), ODS is clearly preferable, and that 
it would, most likely, be so even for cases 4 and 5 (load 
prediction errors of 15 and 20%). From Fig. 12.20 which 
has been generated assuming ^ = (i.e. no weight being 
given to variability of the operating cost), it is clear that 
the utility curve is steeper as k^ decreases. This means that 
the load prediction error has a more profound effect on the 
utility function when the expected value of operating cost 
has a lower weight; this is consistent with the earlier obser- 
vation that the load prediction uncertainty affects the loss 
of cooling capability. One notes that RAS is only preferred 
under a limited set of conditions. While the exact location 
in parameter space of the switchover point between the two 
strategies may change from application to application, this 




10 15 

Load prediction uncertainty (%) 

Fig. 12.20 Overall utility of ODS vs load prediction uncertainty as- 
suming k,=0 
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approach supports the idea that well characterized systems 
may be operated under ODS except when the load predic- 
tion uncertainty is higher than 15%. Of course, this result 
is specific to this illustrative case study and should not be 
taken to apply universally. 



Problems 

Pr.12.1 Identify and briefly discuss at least two situations 
(one from everyday life and one from an engineering or en- 
vironmental field) which would qualify as problems falling 
under the following categories: 

(a) Low epistemic and low uncertainty 

(b) Low epistemic and high uncertainty 

(c) High epistemic and low aleatory 

(d) High epistemic and high aleatory 

Pr. 12.2 The owner of a commercial building is conside- 
ring improving the energy efficiency of his building systems 
using both energy management options (involving little or no 
cost) and retrofits to energy equipment. Prepare: 

(a) an influence diagram for this situation 

(b) a decision tree diagram (similar to Fig. 12.3) 

Pr. 12.3 An electric utility is being pressured by the public 
utility commission to increase its renewable energy portfolio 
(in this particular case, energy conservation does not quali- 
fy). The decision-maker in the utility charges his staff to pre- 
pare the following (note that there are different possibilities): 

(a) an influence diagram for this situation, 

(b) a decision tree diagram. 

Pr. 12.4 Compute the representative discretized values and 
the associated probabilities for the Gaussian distribution (gi- 
ven in Appendix A3) for: 

(a) a 3-point approximation, 

(b) a 5-point approximation. 

Pr. 12.5 Consider Example 12.2.4. 

(a) Rework the problem including the effect of increasing 
unit electricity cost (linear increase over 10 years of 3% 
per year). What is the corresponding probability level of 
moving at which both AC options break even, i.e., the 
indifference point? 

(b) Rework the problem but with an added complexity. It 
was stated in the text (Sect. 12.2.6a) that discounted 
cash flow analysis where all future cash flows are con- 
verted to present costs is an example of an additive uti- 
lity function. Risk of future payments is to be modeled 
as increasing linearly over time. The 1st year, the risk is 
low, say 2%, the 2nd year is 4% and so on. How would 
you modify the traditional cash flow analysis to include 
such a risk attitude? 



Pr. 12.6 Traditional versus green homes 
An environmentally friendly "green" house costs about 25% 
more to construct than a conventional home. Most green ho- 
mes can save 50% per year on energy expenses to heat and 
cool the dwelling. 

(a) Assume the following: 

- It is an all-electric home needing 8 MWh/year for 
heating and cooling, and 6 MWh/year for equipment 
in the house 

- The conventional home costs $ 250,000 and its life 
span is 30 years 

- The cost of electricity is $ 15 cents with a 3% annual 
increase 

- The "green" home has the same life and no additio- 
nal value at the end or 30 years can be expected 

- The discount rate (adjusted for inflation) is 6% per 
year. 

(b) Is the "green" home a worthwhile investment purely 
from an economic point of view? 

(c) If the government imposes a carbon tax of $ 25/ton, 
how would you go about reevaluating the added benefit 
of the "green" home? Assume 170 lbs of carbon are re- 
leased per MWh of electricity generation (corresponds 
to California electric mix). 

(d) If the government wishes to incentivize such "green" 
construction, what amount of upfront rebate should be 
provided, if at all? 

Pr. 12.7 Risk analysis for building owner being sued for in- 
door air quality 

A building owner is being sued by his tenants for claimed 
health ailments due to improper biological filtration of the 
HVAC system. The suit is for $ 500,000. The owner can 
either settle for $ 100,000 or go to court. If he goes to court 
and loses, he has to pay the lawsuit amount of $ 450,000 plus 
the court fees of $ 50,000. If he wins, the plaintiffs will have 
to pay the court fees of $ 50,000. 

(a) Draw a decision tree diagram for this problem 

(b) Construct the payoff table 

(c) Calculate the cutoff probability of the owner winning 
the lawsuit at which both options are equal to him. 

Pr. 12.8' Plot the following utility functions and classify the 

decision-maker's attitude towards risk: 

100 + 0.5X 
(a) U{x) = ^_ - 200 < X < 300 



(b) U{x) : 

(c) U{X): 



250 

(x + 100)'/^ 
20 

x^ -2x2 -x+ jQ 



800 



100 < X < 300 
<x < 10 



' From McClave and Benson (1988) by © permission of Pearson Edu- 
cation. 
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Table 12.20 


Payoff table for Problem 12.9 






Chance events 






Alternative 


S,(p=0.1) 


S,(p=0.3) 


S, (p=0.4) 


S,(P = 0.2) 


A#l 


$2.5k 


$0.5k 


$0.9k 


$2.9k 


A#2 


$3.0k 


$3.4k 


$0.1k 


$2.4k 


A#3 


$2.2k 


$ 1.1k 


$1.8k 


$1.3k 


A#4 


$0.2k 


$3.4k 


$3.7k 


$0.7k 



Table 1 2.21 Data table for Problem 12.11 




Percentage level of 
design goal met (%) 


Probability p(L) 


Annual savings in ope- 
rating cost ($/year) 


90 


0.25 


3,470 


70 


0.40 


2,920 


50 


0.25 


2,310 


30 


0.10 


1,560 



0.1x^ + 1 Ox 
(d) t/(x)= — 0<x<10 

Pr. 12.9 Consider payoff Table 12.20 with four alternatives 
and four chance events (or states of nature). 

(a) Calculate the alternative with the highest expected va- 
lue (EV). 

(b) Resolve the problem assuming the utility cost function 
is exponential with zero minimum monetary value and 
risk tolerance R=$ 2 k. 

Pr. 12.10 Monte Carlo analysis for evaluating risk for pro- 
perty retrofits 

The owner of a large office building is considering making 
major retrofits to his property with system upgrades. You 
will perform a Monte Carlo analysis with 5,000 trials to in- 
vestigate the risk involved in this decision. Assume the fol- 
lowing: 

- The capital investment is $ 2 M taken to be normally dis- 
tributed with 10% CV. 

- The annual additional revenue depends on three chance 
outcomes: 

$ 0.5 M under good conditions, probability p(Good) = 0.4 
$ 0.3 M under average conditions, p( Average) =0.3 
$ 0.2 M under poor conditions, p(Poor) = 0.3 

- The annual additional expense of operation and mainte- 
nance is normally distributed with $ 0.3 M and a standard 
deviation of $ 0.05. 

- The useful life of the upgrade is 10 years with a normal 
distribution of 2 years standard deviation. 

- The effective discount rate is 6%. 

Pr. 12.11' Risk analysis for pursuing a new design of com- 
mercial air-conditioner 

A certain company which manufactures compressors for 
commercial air-conditioning equipment has developed a new 
design which is likely to be more efficient than the existing 
one. The new design has a higher initial expense of $ 8,000 
but the advantage of reduced operating costs. The life of both 
compressors is 6 years. The design team has not fully tes- 
ted the new design but has a preliminary indication of the 
efficiency improvement based on certain limited tests have 
been determined. The preliminary indication has some un- 



certainty regarding the advantage of the new design, and this 
is quantified by a discrete probability distribution with four 
discrete levels of the efficiency improvement goal as shown 
in the first column of Table 12.21. Note that the annual sa- 
vings are incremental values, i.e., the savings over those of 
the old compressor design. 

(a) Draw the influence diagram and the decision tree dia- 
gram for this situation 

(b) Calculate the present value and determine whether the 
new design is more economical 

(c) Resolve the problem assuming the utility cost function 
is exponential with zero minimum monetary value and 
risk tolerance R=$ 2,500 

(d) Compute the EVPI (expected value of perfect informa- 
tion) 

(e) Perform a Monte Carlo analysis (with 1 ,000 trials) assu- 
ming random distributions with 10% CV for p(L = 90) 
and p(L = 30), 5% for p(L = 70). Note that p(L = 50) 
can be to be determined from these, and that negative 
probability values are inadmissible, and should be set to 
zero. Generate the PDF and the CDF and determine the 
5% and the 95% percentiles values of EV 

(f) Repeat step (e) with 5,000 trials and see if there is a 
difference in your results 

(g) Repeat step (f) but now consider the fact that the life of 
both compressors is normally distributed with mean of 
6 years and standard deviation of 0.5 years. 

Pr. 12.12'' Monte Carlo analysis for deaths due to radiation 
exposure 

Human exposure to radiation is often measured in rems (ro- 
entgen-equivalent man) or millirem (mrem). The cancer risk 
caused by exposure to radiation is thought to be approxi- 
mately one fatal cancer per 8,000 person-rems of exposure 
(e.g., one cancer death if 8,000 people are exposed to one 
rem each, or 10,000 exposed to 0.8 rems each,...). Natural 
radioactivity in the environment is thought to result in an ex- 
posure of 130 mrem/year. 

(a) How many cancer deaths can be expected in the United 
States per year as a result (assume a population of 300 
million). 

(b) Reanalyze the situation while considering variability. 
The exposure risk of 8,000 person-rems is normally 



^ Adapted from Sullivan et al. (2009) by © permission of Pearson Edu- 
cation. 



' From Masters and Ela (2008) by © permission of Pearson Education. 
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distributed with 10% CV, while the natural environment 
radioactivity has a value of 20% CV. Use 10,000 tri- 
als for the Monte Carlo simulation. Calculate the mean 
number of deaths and the two standard deviation range. 
(c) The result of (b) could have been found without using 
Monte Carlo analysis. Using formulae presented in 
Sect. 2.3.3 related to functions of random variables, com- 
pute the number of deaths and the two standard deviation 
range and compare these with the results from (b). 

Pr. 12.13 Radon is a radioactive gas resulting from radio- 
active decay of uranium to be found in the ground. In certain 
parts of the country, its levels are elevated enough that when 
it seeps into the basements of homes, it puts the homeowners 
at an elevated risk of cancer. EPA has set a limit of 4 pCi/L 
(equivalent to 400 mrem/year) above which mitigation mea- 
sures have to be implemented. Assume the criterion of one 
fatal cancer death per 8,000 person-rems of exposure. 

(a) How many cancer deaths per 100,000 people exposed 
to 4 pCi/L can be expected? 

(b) Reanalyze the situation while considering variability. 
The exposure risk of 4pCi/L is a log-normal distribu- 
tion with 10% CV, while the natural environment ra- 
dioactivity is normally distributed with 20% CV. Use 
10,000 trials for the Monte Carlo simulation. Calculate 
the mean number of deaths and the 95% confidence in- 
tervals. 

(c) Can the result of (b) be found using formulae presented 
in Sect. 2.3.3 related to functions of random variables? 
Explain. 

Pr. 12.14 Monte Carlo analysis for buying an all-electric car 
A person is considering buying an all-electric car which he 
will use daily to commute to work. He has a system at home 
of recharging the batteries at night. The all-electric car has a 
range of 100 miles while the minimum round trip of a daily 
commute is 40 miles. However, due to traffic congestion, he 
sometimes takes a longer route which can be represented by 
a log-normal distribution of standard deviation 5 miles. Mo- 
reover, his work requires him to visit clients occasionally for 
which purpose he needs to use his personal car. Such extra 
trips have a lognormal distribution of mean of 20 miles and a 
standard deviation of 5 miles. 

(a) Perform a Monte Carlo simulation (using 10,000 trials) 
and estimate the 99% probability of his car battery run- 
ning dry before he gets home on days when he has to 
make extra trips and take the longer route. 

(b) Can you verify your result using an analytical appro- 
ach? If yes, do so and compare results. 

(c) If the potential buyer approaches you for advice, what 
types of additional analyses will you perform prior to 
making your suggestion? 



Pr. 12.15 Evaluating feed-in tariffs versus upfront rebates 
for PV systems 

One of the promising ways to promote solar energy is to 
provide incentives to homeowners to install solar PV panels 
on their roofs. The PV panels generate electricity depending 
on the collector area, the efficiency of the PV panels and 
the solar radiation falling on the collector (which depends 
both on the location and on the tilt and azimuth angles of 
the collector). This production would displace the need to 
build more power plants and reduce adverse effects of both 
climate change and health ailments of inhaling polluted air. 
Unfortunately, the cost of such electricity generated in yet 
to reach grid-parity, i.e., the PV electricity costs more than 
traditional options, and hence the need for the government to 
provide incentives. 

Two different types of financing mechanisms are com- 
mon. One is the feed-in tariff wheie the electricity generated 
at-site is sold back to the electric grid at a rate higher than 
what the customer is charged by the utility (this is common 
in Europe). The other is the upfront rebate financial mecha- 
nism where the homeowner (or the financier/installer) gets 
a refund check (or a tax break) per Watt-peak (W ) installed 
(this is the model adopted in the U.S.). You are asked to eva- 
luate these two options using the techniques of decision-ma- 
king, and present your findings and recommendations in the 
form of a report. Assume the following: 

- The cost of PV installation is $ 7 per Watt-peak (W ). This 
is the convention of rating and specifying PV module per- 
formance (under standard test conditions of 1 kW/m^ and 
25 °C cell temperature) 

- However, the PV module operates at much lower radia- 
tion conditions throughout the year, and the conventio- 
nal manner of considering this effect is to state a capacity 
factor which depends on the type of system, the type of 
mounting and also on location. Assume a capacity factor 
of 20%, i.e., over the whole year (averaged over all 24 h/ 
day and 365 days/year), the PV panel will generate 20% 
of the rated value 

- Life of the PV system is 25 years with zero operation and 
maintenance costs 

- The average cost of conventional electricity is $ 0.015/ 
kWh with a 2% annual increase 

- The discount rate for money borrowed by the homeowner 
towards the PV installation is 4%. 
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A: Statistical Tables 



Table A.1 Binomial probability sums V b(x;n,p) 
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Table A.2 Cumulative probability values for the Poisson distribution 
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Table A.3 The standard nonnal distribution (cumulative z curve areas) 

Standard normal (z) curve 



Tabulated area 












^* J 














z* 


.00 


.01 


.02 


.03 

.0001 


.04 

.0001 


.05 

.0001 


.06 


.07 


.08 


.09 


-3^ 


.0001 


.0001 


.0001 


,0001 


.0001 


.0001 


.0000 


-3.7 


.0001 


.0001 


.0001 


.0001 


.0001 


.0001 


,0001 


.0001 


.0001 


.0001 


-3.6 


.0002 


.0002 


.0001 


.0001 


.0001 


.0001 


.0001 


.0001 


.0001 


.0001 


-3.5 


.0002 


.0002 


.0002 


,0002 


.0002 


.0002 


.0002 


.0002 


.0002 


.0002 


-3.4 


.0003 


.0003 


.0003 


.0003 


.0003 


.0003 


.0003 


.0003 


.0003 


.0002 


-33 


.0005- 


.0005 


.0005 


.0004 


.0004 


.0004 


.0004 


.0004 


.0004 


.0003 


-3.2 


.0007 


.0007 


.0006 


.0006 


.0006 


.0006 


,0006 


.0005 


,0005 


.0005 


-3.1 


.0010 


.0009 


.0009 


.0009 


.0008 


.0008 


.0008 


.0008 


.0007 


.0007 


-3.0 


.0013 


.0013 


.0013 


.0012 


.0012 


.0011 


.0011 


.0011 


.0010 


.0010 


-2.9 


.0019 


.0018 


.0018 


.0017 


.0016 


,0016 


,0015 


.0015 


.0014 


.0014 


-2.8 


.0026 


.0025 


.0024 


.0023 


.0023 


.0022 


,0021 


.0021 


.0020 


.0019 


-2.7 


.0035 


.0034 


.0033 


.0032 


.0031 


.0030 


.0029 


.0028 


.0027 


.0026 


-2.6 


.0047 


.0045 


.0044 


.0043 


.0041 


.0040 


.0039 


.0038 


.0037 


.0036 


-2.5 


.0062 


.0060 


.0059 


.0057 


.0055 


.0054 


,0052 


.0051 


.0049 


.0048 


-2.4 


.0082 


.0080 


.0078 


.0075 


.0073 


.0071 


.0069 


.0068 


.0066 


.0064 


-2J 


.0107 


.0104 


.0102 


.0099 


.0096 


.0094 


.0091 


.0089 


.0087 


.0084 


-2.2 


.0139 


.0136 


.0132 


,0129 


.0125 


.0122 


.0119 


.0116 


.0113 


.0110 


-2.1 


.0179 


.0174 


.0170 


.0166 


.0162 


.0158 


.0154 


.0150 


.0146 


.0143 


-2.0 


.0228 


.0222 


.0217 


.0212 


.0207 


.0202 


.0197 


.0192 


.0188 


.0183 


-1.9 


.0287 


.0281 


.0274 


.0268 


.0262 


.0256 


.0250 


.0244 


.0239 


.0233 


-1.8 


.0359 


.0351 


.0344 


.0336 


.0329 


.0322 


.0314 


.0307 


.0301 


.0294 


-1.7 


.0446 


.0436 


.0427 


.0418 


.0409 


.0401 


.0392 


.0384 


.0375 


.0367 


-1.6 


.0548 


.0537 


.0526 


.0516 


.0505 


.0495 


.0485 


.0475 


.0465 


.0455 


-1.5 


.0668 


.0655 


.0643 


.0630 


.0618 


.0606 


.0594 


.0582 


.0571 


.0559 


-1.4 


.0808 


.0793 


.0778 


.0764 


.0749 


.0735 


.0721 


.0708 


,0694 


.0681 


-IJ 


.0968 


.0951 


.0934 


.0918 


.0901 


.0885 


,0869 


.0853 


.0838 


.0823 


-1.2 


.1151 


.1131 


.1112 


.1093 


.1075 


.1056 


.1038 


.1020 


.1003 


.0985 


-1.1 


.1357 


.1335 


.1314 


.1292 


.1271 


.1251 


.1230 


.1210 


.1190 


.1170 


-1.0 


.1587 


.1562 


.1539 


.1515 


.1492 


.1469 


.1446 


.1423 


.1401 


.1379 


-0.9 


.1841 


.1814 


.1788 


.1762 


.1736 


.1711 


.1685 


.1660 


.1635 


.1611 


-0.8 


,2119 


.2090 


.2061 


.2033 


.2005 


.1977 


.1949 


.1922 


.1894 


.1867 


-0.7 


.2420 


.2389 


.2358 


.2327 


.2296 


.2266 


.2236 


.2206 


.2177 


.2148 


-0.6 


.2743 


.2709 


.2676 


.2643 


.2611 


.2578 


.2546 


.2514 


.2483 


.2451 


-0.5 


.3085 


.3050 


.3015 


.2981 


.2946 


.2912 


.2877 


,2843 


.2810 


.2776 


-0.4 


.3446 


.3409 


.3372 


.3336 


.3300 


.3264 


.3228 


.3192 


.3156 


.3121 


-OJ 


.3821 


.3783 


.3745 


.3707 


.3669 


.3632 


.3594 


.3557 


.3520 


.3483 


-0.2 


.4207 


.4168 


.4129 


.4090 


.4052 


.4013 


.3974 


.3936 


.3897 


.3859 


-0.1 


.4602 


.4562 


.4522 


.4483 


,4443 


.4404 


.4364 


,4325 


.4286 


.4247 


-0.0 


.5000 


.4960 


.4920 


.4880 


.4840 


.4801 


.4761 


.4721 


.4681 


.4641 
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Table A.4 Critical values of the t-distribution for confidence and prediction intervals 



Central area 



/de^ity curve ^^^,^^ 




/ density curve 



! 

■ 1 critical value 



t critical value 




I critical value 



Central area ~ confidence/prediction level 

for two-sided interval: 
Cumulative area = conRdence/prediction 

level f(H- one-sided interval: 



80% 90% 95% 9%% 
90% 95% 97.5% 99% 



99% 99.8% 99.9% 
99.5% 99.9% 99.95% 





1 


3.078 


6.314 


12.706 


31.821 


63.657 


318,31 


636.62 




2 


1.886 


2.920 


4.303 


6.965 


9.925 


22.326 


31.598 




3 


1.638 


2.353 


3.182 


4.541 


5,841 


10.213 


12.924 




4 


1.533 


2.132 


2.776 


3.747 


4.604 


7.173 


8,610 




5 


1.476 


2.015 


2.571 


3.365 


4.032 


5.893 


6.869 




6 


1.440 


1.943 


2,447 


3.143 


3,707 


5.208 


5.959 




7 


1.415 


1.895 


2.365 


2,998 


3,499 


4.785 


5.408 




8 


1.397 


1.860 


2.306 


2.896 


3,355 


4.501 


5.041 




9 


1.383 


1.833 


2,262 


2.821 


3,250 


4.297 


4,781 




10 


1.372 


1.812 


2.228 


2.764 


3.169 


4.144 


4.587 




11 


1.363 


1.796 


2.201 


2.718 


3.106 


4.025 


4.437 




IZ 


1.356 


1.782 


2.179 


2.681 


3.055 


3.930 


4.318 




15 


1.330 


1.771 


2.160 


2.650 


3.012 


3.852 


4.221 




14 


1.345 


1.761 


2.145 


2.624 


2.977 


3.787 


4.140 




15 


1.341 


1.753 


2.131 


2.602 


2.947 


3,733 


4.073 




16 


1,337 


1.746 


2.120 


2,583 


2.921 


3.686 


4.015 


Degrees of 


17 


1.333 


1.740 


2.110 


2.567 


2.898 


3.646 


3.965 


freedom 


18 


1.330 


1.734 


2.101 


2.552 


2.878 


3.610 


3.922 




19 


1.328 


1.729 


2.093 


2.539 


2.861 


3.579 


3.883 




20 


1,325 


1.725 


2.086 


2.528 


2,845 


3.552 


3.850 




21 


1.323 


1.721 


2.080 


2.518 


2.831 


3.527 


3,819 




22 


1.321 


1.717 


2.074 


2.508 


2.819 


3.505 


3.792 




23 


1.319 


1.714 


2.069 


2,500 


2,807 


3.485 


3.767 




24 


1.318 


1,711 


2.064 


2.492 


2.797 


3.467 


3.745 




25 


1.316 


1.708 


2.060 


2.485 


2,787 


3.450 


3,725 




26 


1.315 


1.706 


2.056 


2.479 


2.779 


3.435 


3.707 




27 


1.314 


1.703 


2.052 


2.473 


2.771 


3,421 


3.690 




28 


1.313 


1.701 


2.048 


2.467 


2.763 


3,408 


3.674 




29 


1.311 


1,699 


2,0+5 


2.462 


2.756 


3.396 


3.659 




30 


1.310 


1.697 


2.042 


2.457 


2.750 


3.385 


3.646 




40 


1.303 


1,684 


2,021 


2,423 


2.704 


3.307 


3.551 




60 


1.296 


1.671 


2.000 


2.390 


2,660 


3.232 


3.460 




120 


1.289 


1.658 


1.980 


2.358 


2.617 


3.160 


3.373 




OO 


1.282 


1.645 


1.960 


2.326 


2.576 


3.090 


3.291 
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Table A.7 Critical values of the correlation coefficient 



The tabled entnes represent the critical vaJues ot r, based on n pairs of observations. 
for testing the hypothesis p— at the .05 and .01 significance levels. 





df 


Two- tailed /? 


One- tailed p 


« 


.05 


.01 


-05 .01 


4 


a 


.950 


.990 


.900 .980 


S 


I 


.878 


.959 


.805 .934 


6 


4 


.811 


.917 


.729 .882 • 


7 


5 


.754 


.875 


.669 .833 


1 


i 


.707 


.834 


.621 .789 


f 


7 


.666 


.798 


.582 .750 


Id 


8 


.632 


.765 


.549 .715 


II 


9 


.602 


.735 


.521 .685 


12 


10 


.576 


.708 


.497 .658 


n 


11 


.553 


.684 


.476 .634 


14 


12 


.532 


.661 


.457 .612 


15 


13 


.514 


.641 


.441 .592 


16 


14 


.497 


623 


.426 .574 


It 


15 


.482 


.606 


.412 .558 


IS 


t6 . 


.468 


.590 


.400 .542 


19 


17 


.456 


.575 


.389 .529 


20 


18 


.444 


,561 


.378 .515 


25 


23 


.396 


.505 


.337 .462 


30 


28 


.361 


.463 


.306 .423 


40 


38 


.312 


.402 


,264 .367 


60 


58 


.254 


.330 


.214 .300 


120 


118 


.179 


.234 


.151 .212 



For unlisted values of «, the critical value of r is given by 



r— 



^t^-^n-l 



where t^. is the critical t value associated with a given significance level and has 
<//=n~2. 
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Table A.9 Critical values for Dunnett's method 



Two-sided comparisons 







k-l 


= number of treatment means (excluding control) 




n-k 


1 


2 


3 


4 


5 


6 


7 


8 


9 


5 


2.57 


3.03 


3.29 


3.48 


3.62 


3.73 


3.82 


3.90 


3.97 


6 


2.45 


2.86 


3.10 


3.26 


3.39 


3.49 


3.57 


3.64 


3.71 


7 


2.36 


2.75 


2.97 


3.12 


3.24 


3.33 


3-41 


3.47 


3.53 


8 


2.31 


2.67 


2.88 


3.02 


3.13 


3.22 


3.29 


3.35 


3-41 


9 


2.26 


2.61 


2.81 


2.95 


3.05 


3.14 


3.20 


3.26 


3.32 


10 


2.23 


2.57 


2.76 


2.89 


2.99 


3.07 


3.14 


3.19 


3.24 


11 


2.20 


2.53 


2.72 


2.84 


2.94 


3.02 


3.08 


3.14 


3.19 


12 


2.18 


2.50 


2.68 


2.81 


2.90 


2.98 


3.04 


3,09 


3.14 


13 


2.16 


2.48 


2.65 


2.78 


2.87 


2.94 


3.00 


3,06 


3.10 


14 


2.14 


2.46 


2.63 


2.75 


2.84 


2.91 


2.97 


3,02 


3.07 


15 


2.13 


2.44 


2.61 


2.73 


2.82 


2.89 


2.95 


3,00 


3.04 


16 


2.12 


2.42 


2.59 


2.71 


2.80 


2.87 


2.92 


2.97 


3.02 


17 


2.11 


2.41 


2.58 


2.69 


2.78 


2.85 


2.90 


2,95 


3.00 


18 


2.10 


2.40 


2.56 


2.68 


2.76 


2.83 


2.89 


2.94 


2.98 


19 


2.09 


2.39 


2.55 


2.66 


2.75 


2.81 


2.87 


2.92 


2.96 


20 


2.09 


2.38 


2.54 


2.65 


2.73 


2.80 


2.86 


2.90 


2.95 


24 


2.06 


2.35 


2.51 


2.61 


2.70 


2.76 


2.81 


2.86 


2.90 


30 


2.04 


2.32 


2.47 


2.58 


2.66 


2.72 


2.77 


2.82 


2.86 


40 


2.02 


2.29 


2.44 


2.54 


2.62 


2.68 


2.73 


2.77 


2.81 


60 


2.00 


2.27 


2.41 


2.5! 


2.58 


2.64 


2.69 


2.73 


2.77 


120 


1.98 


2.24 


2.38 


2.47 


2.55 


2.60 


2.65 


2.69 


2.73 


oo 


1.96 


2.21 


2.35 


2.44 


2.51 


2.57 


2.61 


2.65 


2.69 



' Reproduced from C. W. DunneK. "New Tables for Multiple Comparison with a Control; 
Biometrics. Vol. 20, No. 3, 1964. 
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Table A.I Critical values of Spearman's rank correlation coefficient 

The a values correspond to a one-tailed test at 
Ho; ps = 0, The value should be doubled ior 
two-tailed tests. 



n 


a =.05 


a = ,025 


a = .01 


a =.005 


5 


.900 








6 


.829 


.886 


.943 




7 


.714 


.786 


.89? 




8 


.64? 


.738 


.833 


.881 


9 


.600 


.683 


.78? 


.833 


10 


.$64 


.648 


.745 


.794 


11 


.52? 


.623 


.736 


.818 


12 


.497 


.591 


.70? 


.780 


13 


.475 


.566 


.67? 


.745 


14 


.457 


.545 


.646 


.716 


15 


.441 


.525 


.623 


.689 


16 


.425 


.507 


.601 


.666 


17 


.412 


.490 


.582 


.645 


18 


.599 


.476 


.564 


.625 


19 


.388 


.462 


.549 


.608 


20 


.?77 


.450 


,534 


.591 


21 


.368 


.438 


.521 


.576 


22 


.359 


.428 


.508 


.562 


23 


.351 


.418 


.496 


.549 


24 


.343 


.409 


.485 


.537 


25 


.336 


,400 


.475 


.526 


26 


.329 


.392 


.465 


.515 


27 


323 


.385 


.456 


.505 


2S 


.317 


.377 


.448 


.496 


29 


.311 


.370 


.440 


,487 


30 


.305 


.364 


.432 


.478 



Source; From E. G. Olds, "Distribution of Sums 
of Squares of Rank Differences for Small Samples." 
Annals of Mathematical Statistics, 19J8, 9. 
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Table A.1 1 Critical values of T^^ and T^j for the Wilcoxon rank sum for independent samples 

Test statistic is the rank sum associated with the smaller sample (if equal sample sizes, either rank sum 
can be used). 

a. a = .025 one~teiied; a = ,QS two-tailed 



\. 


































"^\ 




3 


4 




5 




6 




7 




8 




9 




10 




^L 


Tu 


Tl 


fu 


h 


h 


Tl 


Tu 


T-L 


Tu 


Tl 


Tu 


Tl 


Tu 


Tl 


Tu 


3 


5 


16 


6 


18 


6 


21 


7 


23 


7 


26 


8 


28 


8 


31 


9 


33 


4 


6 


i8 


11 


25 


12 


28 


12 


32 


13 


35 


14 


38 


!5 


41 


16 


44 


5 


6 


21 


12 


28 


18 


37 


19 


41 


20 


45 


21 


49 


22 


53 


24 


56 


6 


7 


23 


12 


32 


19 


41 


26 


52 


28 


56 


29 


6! 


31 


65 


32 


70 


7 


7 


26 


13 


35 


20 


45 


28 


56 


37 


68 


39 


73 


41 


78 


43 


83 


8 


8 


28 


H 


38 


21 


49 


29 


61 


39 


73 


49 


87 


51 


93 


54 


98 


9 


8 


5[ 


15 


41 


22 


53 


31 


65 


41 


78 


51 


93 


63 


108 


66 


1!4 


to 


9 


33 


16 


44 


24 


56 


32 


70 


43 


83 


54 


98 


66 


114 


79 


131 



h. a = 


= .05 


one-tailed; 


a = 


= .10 two-tailed 






















3 




4 




5 8 






7 




B 




9 




10 




Tl 


Tu 


Tl 


To 


Tl Tu 


Tl 


Tu 


Tl 


Tu 


Tl 


Tu 


Tl 


Tu 


Tl 


Tu 


3 


6 


15 


7 


17 


7 20 


8 


22 


9 


24 


9 


27 


10 


29 


n 


31 


4 


7 


17 


12 


24 


13 27 


14 


30 


15 


35 


16 


36 


17 


59 


18 


42 


5 


7 


20 


13 


27 


19 36 


20 


40 


22 


43 


24 


46 


25 


50 


26 


54 


6 


8 


22 


14 


30 


20 40 


28 


50 


30 


54 


32 


58 


35 


63 


55 


67 


7 


9 


24 


IS 


35 


22 43 


30 


54 


39 


66 


41 


71 


43 


76 


46 


80 


8 


9 


27 


16 


36 


24 46 


32 


58 


41 


71 


52 


84 


54 


90 


57 


95 


9 


10 


29 


17 


39 


25 50 


35 


63 


43 


76 


54 


90 


66 


105 


69 


111 


10 


U 


31 


18 


42 


26 54 


35 


67 


46 


80 


57 


95 


69 


in 


83 


127 



Source: From F, Wilcoxon and R. A. Wilcox, "Some Rapid Approximate Statistical Procedures," 1964, 20-25. 
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Table A.12 Critical values of T for the Wilcoxon paired difference signed rank test 



ONE-TAtLED 


TWOTAIUEO 


fl = 5 


/> = 6 


= 7 


n = 8 


ft = 9 


n= io 


a = .05 


ft = .10 


1 


2 


4 


6 


B 


LI 


a = ,025 


a = .05 




1. 


a 


4 


$ 


8 


a = .01 


a = ,02 









■^ 


% 


f 


O-.005 


a =.01 
a= .10 











2 


3 




n= 11 


n = 12 


ff = 13 


n = 14 


rt= IS 


fi=16 


am .05 


H 


17 


21 


26 


30 


36 


a = .025 


a =,05 


U 


14 


17 


21 


25 


30 


a>BX)l 


a = .02 


7 


10 


1? 


16 


20 


24 


af=,005 


a= .01 
a =.10 


5 


1 


10 


13 


16 


19 




n= 17 


/i = 18 


n = t9 


n = 20 


11 = 21 


n = 22 


o = .05 


41 


47 


54 


60 


68 


75 


a =.025 


a = .05 


35 


40 


46 


52 


59 


66 


tt= .01 


a= .(& 


28 


?3 


33 


43 


49 


56 


tt».O0S 


a = .01 


2? 


28 


?2 


n 


43 


49 




a = .10 


n = 23 


/t = 24 


/l = 25 


rt = 26 


n = 27 


rj = 2e 


fl=.06 


83 


92 


IQl 


no 


120 


130 


a = .025 


a = .05 


73 


81 


90 


98 


107 


117 


«-,01 


ft = ,02 


62 


69 


77 


85 


93 


102 


a= .005 


a= ,01 
a =10 


55 


61 


68 


76 


84 


92 




n = 29 


fj = 30 


/I = 31 


f! = 32 


n = 33 


n = 34 


a = .05 


HI 


152 


163 


175 


188 


201 


It = .025 


a =.05 


127 


137 


HB 


159 


171 


183 


a =01 


a= .02 


111 


120 


130 


141 


151 


162 


tt = .O05 


ft = .01 
ft =.10 


too 


109 


118 


128 


138 


149 




n = 35 


/J = 36 


rt = 37 


n = 38 
256 


271 




«=.05 


214 


228 


242 




a =.025 


ft= .05 


195 


208 


222 


235 


250 




a =.01 


a = .02 


MA 


186 


198 


211 


224 




a =.005 


ft = .01 
ft =.10 


160 


171 


183 


195 


208 






r) = 4C 


/J = 41 


n = 42 


rt = 43 


« = 44 


rt = 4S 


«» .05 


287 


303 


319 


336 


353 


371 


a =.025 


ft =,05 


264 


279 


295 


311 


327 


344 


a =.01 


ft =,02 


238 


252 


267 


281 


297 


313 


a =.005 


a =.01 
ft =10 


221 


254 


248 


262 


277 


292 




f7 = 46 


ft = 47 


fl = 48 


n= 49 


n = 50 




a=s.05 


389 


408 


4Z7 


446 


466 




a = .025 


ft =.05 


361 


379 


397 


415 


' 434 




a = .01 


ft =.02 


329 


345 


362 


380 


398 




a= .005 


ft= ,01 


307 


323 


339 


356 


373 





Source: From F. Wilcoxon and R. A. Wilcox, "Some Rapid Approximate Statistical PraceduKs," 
1964. p. 28, 
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Table A.13 Critical values of the Durbin- Watson statistic for 5% significance level 





k = 


^1 


k «2 


k = 


= 3 


k = 


= 4 


k = 


= 3 


H 


dL 


^u 


rfz, 


du 


rfz, 


^u 


dL 


du 


di- 


du 


15 


1.08 


1.36 


0.95 


1.54 


0.82 


1.75 


0.69 


1.97 


0.56 


2.21 


16 


LIO 


1.37 


0.98 


1.54 


0.86 


1.73 


0.74 


1.93 


0.62 


2.15 


17 


1.13 


1.38 


1.02 


1,54 


0.90 


L7I 


0.78 


t,90 


0,67 


2.10 


18 


1.16 


1.39 


1.05 


1.53 


0.93 


1.69 


0.S2 


1,87 


0.71 


2.06 


19 


1.18 


1.40 


1.08 


1.53 


0.97 


1.68 


0.86 


1.85 


0.75 


2.02 


20 


1.20 


1.4i 


1.10 


1.54 


1,00 


1.68 


0.90 


1.83 


0.79 


1.99 


21 


1.22 


1.42 


1.13 


1.54 


1.03 


1.67 


0.93 


1.81 


0.83 


1.96 


22 


1.24 


1.43 


1.15 


1.54 


1.05 


1.66 


0.96 


1.80 


0.86 


1.94 


23 


1.26 


1.44 


1.17 


1.54 


1.08 


1.66 


0.99 


1.79 


0.^ 


1.92 


24 


1.27 


1.45 


1.19 


1.55 


1.10 


1.66 


1.01 


1.78 


0.93 


1.90 


25 


1.29 


1.45 


1.21 


1.55 


1.12 


1.66 


1.04 


1.77 


0.95 


1.89 


26 


1.30 


1.46 


1.22 


1.55 


1.14 


1,65 


1.06 


1.76 


0.98 


1.88 


27 


1.32 


1.47 


1.24 


1.56 


1.16 


1.65 


1.08 


1.76 


1.01 


1.86 


28 


K33 


1.48 


1.26 


1.56 


t.ts 


1.65 


1.10 


1.75 


1.03 


1.85 


29 


1.34 


1.48 


1.27. 


1,56 


1.20 


1.65 


1.12 


1.74 


1.05 


1.84 


30 


1.35 


1.49 


1.28 


1.57 


1.21 


1.65 


1.14 


1.74 


1.07 


1.83 


31 


1.36 


1.50 


1.30 


1.57 


1.23 


1.65 


1.16 


1.74 


1.09 


1.83 


32 


1.37 


1.50 


1.31 


1.57 


1.24 


1.6S 


1.18 


1.73 


1.11 


1.82 


33 


1.38 


1.51 


1.32 


1.58 


1.26 


1.65 


1.19 


1.73 


1.13 


1.81 


34 


i.39 


1.51 


1.33 


1.58 


1.27 


1.65 


1.21 


1.73 


1.15 


1.8) 


35 


1.40 


1.52 


1.34 


1.58 


1.28 


1.65 


1.22 


1.73 


1.16 


1.80 


36 


1.41 


1.52 


1.35 


1.59 


1.29 


1.65 


1.24 


1.73 


1.18 


1.80 


37 


1.42 


1.53 


1.36 


1.59 


1.31 


1.66 


1.25 


1,72 


1.19 


1.80 


38 


1.43 


1.54 


1.37 


1.59 


1.32 


1.66 


1.26 


1.72 


1.21 


1.79 


39 


1.43 


1.54 


1.38 


1.60 


1.33 


1.66 


1.27 


1.72 


1.22 


1.79 


40 


1.44 


1.54 


1.39 


1.60 


1.34 


1.66 


1.29 


1.72 


1.23 


1,79 


45 


1.48 


1.57 


1.43 


1.62 


1.38 


1.67 


1.34 


1.72 


1.29 


1.78 


50 


1.50 


1.59 


1.46 


1.63 


1.42 


1.67 


1.3S 


1.72 


1.34 


1.77 


55 


1.53 


1.60 


1.49 


1.64 


1,45 


1.68 


1. 41 


1.72 


1.38 


1.77 


60 


1.55 


1.62 


1.51 


1.65 


1.48 


1.69 


1.44 


1.73 


1.41 


1.77 


65 


1.57 


1.63 


1.54 


1.66 


1.50 


1.70 


1.47 


1.73 


1.44 


1,77 


70 


I.S8 


1.64 


1.55 


1.67 


1.52 


1.70 


1.49 


1.74 


1.46 


1.77 


75 


1.60 


1.65 


1.57 


1.68 


1,54 


1.71 


1.31 


1.74 


1,49 


1,77 


80 


1.61 


1.66 


1.59 


1.69 


1.56 


1.72 


1.53 


1.74 


1.51 


1.7? 


85 


1.62 


1.67 


1.60 


1.70 


1.57 


1.72 


t.55 


t.75 


1,52 


1.77 


90 


1.63 


1.68 


1.61 


1.70 


1.59 


1.73 


1,57 


1.75 


1.54 


1.78 


95 


1.64 


1.69 


1.62 


1.71 


1.60 


1.73 


1.58 


1.75 


1.56 


1.78 


too 


1.65 


1.69 


1.63 


1.72 


1.61 


1,74 


1.59 


1.76 


1,57 


1,78 



Sounx: J. Durbin and G. S. Waison, Biometrika, 38 (1951). 
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B: Large Data Sets 



Problem 2.25 Generating cumulative distribution curves and utilizability curves from measured data 

Table B 1 Measured global solar radiation values (All values are in MJ/m^.) on a horizontal surface at Quezon City, Manila during October 1980. 
Taken from Reddy (1987) 



Day 












Soter time 














No. 


06-07 


07--08 


08-09 


09-10 


10-11 


11-12 


12-13 


13-14 


14-15 


15-16 


16-17 


17-18 


Daily 


1 


0.209 


0,837 


1.256 


1,591 


1.214 


1.759 


1,005 


1,172 


0,963 


0,461 


0,126 


0,042 


10.63 


2 


0.126 


0,335 


0.712 


1.172 


1,465 


1,130 


0,796 


0.712 


0,544 


0,544 


0.293 


0,084 


7.91 


3 


0.084 


0,419 


0.754 


0.921 


1.005 


0.963 


0,461 


0.502 


0.419 


0.293 


0,209 


0,084 


6,11 


4 


0.126 


0.754 


2.010 


1,968 


2.303 


2.177 


2.094 


1.800 


1.926 


0,25t 


0.461 


0,209 


16,08 


5 


0.628 


1,214 


1.384 


2.428 


2,387 


2.010 


2,010 


1,298 


2.177 


1,759 


0,921 


0.335 


19,05 


6 


0.209 


0,754 


? W7 


2,596 


2,931 


3.057 


2,722 


2.805 


1,759 


1,214 


0.712 


0.042 


21,19 


7 


0.293 


0,544 


0,S37 


1,800 


2.5% 


3.475 


2,847 


2,135 


1.800 


0,921 


0.837 


0.209 


18,30 


S 


0.419 


1,382 


2,219 


2,010 


2.010 


2.219 


2,094 


2,177 


1,800 


0,837 


0.335 


0.126 


n.63 


9 


0.461 


1,340 


2,303 


2,094 


2,387 


2.763 


2,596 


2.722 


0,963 


0,3T7 


0.293 


0.126 


18,42 


10 


0.335 


0,502 


l,fi0n 


3.057 


2,847 


1.633 


2,135 


0,377 


1.675 


1,298 


0.712 


0.126 


16.50 


n 


0.377 


1,130 


1,842 


2,010 


2.094 


2.219 


2,470 


2.303 


0.502 


0,167 


0.0S4 


0.042 


15,24 


12 


0.209 


0.670 


0,754 


0.754 


0.963 


1.172 


2,973 


2.805 


1,465 


0,419 


0.126 


0.000 


12,31 


13 


0.167 


0.796 


0.963 


2.010 


1.800 


1.884 


0.544 


1.130 


1,884 


1.172 


0.754 


0.084 


13,19 


14 


0.335 


0.419 


1,675 


1.591 


2,094 


2.219 


2.345 


2.177 


1,424 


0.335 


0.084 


0.000 


14,70 


15 


0.126 


1,382 


1.884 


2,094 


2,596 


2.973 


2.596 


2.094 


1.968 


0.419 


0,251 


0.042 


18,42 


16 


0.419 


0.754 


1.549 


2.847 


2,470 


2,428 


0.251 


0,126 


0.209 


0,419 


0,419 


0.126 


12,02 


17 


0,502 


1.256 


2,010 


2,052 


2.763 


2.805 


2,428 


2.554 


t.96S 


0.754 


0,712 


0.084 


19.89 


ts 


0.126 


0.377 


1.633 


2.345 


2.303 


1.256 


1.130 


0.628 


0,586 


0.544 


0,419 


0,126 


11,47 


19 


0.293 


1.214 


1,717 


2,010 


2,805 


2.428 


0.754 


0.544 


0,879 


1.298 


1,298 


0.419 


15,66 


20 


0.126 


0,419 


0,377 


0.293 


0,209 


0.293 


0.335 


0.837 


1,256 


0,754 


0,335 


0.084 


5,32 


21 


0,126 


0.335 


0.335 


0,335 


0.335 


0.419 


0,502 


0.293 


0,335 


0.293 


0,167 


0.084 


3,56 


22 


0.0S4 


0.0S4 


0.126 


0.461 


0.419 


1.172 


2.387 


1.130 


0.628 


0.126 


0,042 


0.000 


6,66 


23 


0,251 


1,382 


2.177 


2,931 


2,638 


2.219 


2.680 


0.377 


0.167 


0.126 


0.000 


0.000 


14.95 


24 


0,126 


0,335 


1,172 


2,345 


2,303 


1.80O 


2.5S4 


1.424 


1,675 


0.251 


0,209 


0.126 


14,32 


25 


0.209 


0,670 


0,712 


1.591 


2,554 


2.S0S 


3.350 


2.722 


2,010 


1.884 


0,754 


0.126 


19.39 


26 


0,377 


1.130 


2,094 


2.219 


2,596 


2.596 


2.303 


2.052 


1.968 


1.172 


0,544 


0.126 


19,18 


27 


0,419 


1,089 


1,968 


2,219 


2,428 


2.638 


2.261 


1,968 


1,424 


0,963 


0,335 


0.126 


17.84 


28 


0,293 


0,544 


0,837 


2,177 


0.879 


0.419 


0.502 


0,544 


0,754 


0,335 


0.084 


0,042 


7,41 


29 


0,000 


0,042 


0.084 


0.377 


0,628 


1.465 


0.963 


0.712 


0.754 


0,586 


0.167 


0,042 


5.82 


30 


0,335 


1,130 


1.591 


1,759 


2,596 


2.428 


0.712 


2.010 


1,884 


1,633 


0.796 


0.126 


17,00 


31 


0,084 


0.209 


0.461 


0,754 


1.591 


2.177 


2.470 


2.847 


1,842 


1,424 


0.502 


0.126 


14,49 


Mean 


0,254 


0.756 


1,359 


1.768 


1.942 


1,968 


1.783 


1,515 


1,278 


0,743 


0,419 


0.107 


13,89 



t Alt values iit in Mf/cn^. 
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Problem 4.15 Comparison of human comfort correlations between Caucasian and Qiinese subjects 



Table B 2 Experimental results of tests conducted with Chinese subjects (provided by Jiang, 2001). The values under the column P_v have been 
computed from psychrometric relations from T-db and RH values. 



Fefnales only 








Combined 










Tdb(*C) 


RH 


? v(kPa) 


PM^' meas 


.yJD meas 


T<fl>(°C) 


RH 


? vCkPa) 


?M^' tmas 


PiiU meas 


20.3 


CiO 


1.154 


-1.4 


50 


203 


0.50 


1.194 


-1.35 


40 


21.5 


030 


0.776 


-1.1 


20 




0.50 


0.776 


-0.8 


18 


21.3 


0:50 


1277 


-12 


30 


23 


0.52 


1.277 


-0.8 


15 




OiO 


1354 


-0.86 


21.6 


15 


0.35 


1.354 


-0.22 


5.6 


21.8 


0.70 


1.346 


-1 


33.3 


25 


0.50 


1.S46 


0.042 





23 


035 


C.553 


-C.85 


21.4 


25.7 


0.54 


0.993 


0.22 





23 


0i2 


1.475 


^.8 


20 


27 


0.31 


1.475 


0.27 


42 


23 


0.76 


2.156 


-0.25 





26.4 


0.73 


2.156 


0.6 


18.2 


25 


035 


1.109 


-0.25 


12.5 


273 


0.74 


1.109 


1.59 


50 


25 


OJO 


1385 


-0.083 





213 


0.30 


1.585 


-0.86 


20 


24.7 


0.74 


2308 


0.2 





213 


0.50 


2308 


-0.91 


228 


25.7 


0J4 


1.":'93 








21.8 


0.70 


1.793 


-0.^5 


20.8 




CJl 


1.116 


C.2 


c 


23 


0.35 


1.116 


-C.~5 


16.7 


27.1 


033 


1.195 


0.33 


8.4 


23 


0.76 


1.195 


-0.25 





27.4 


0i4 


1590 


1,2 


40 


24.7 


0.74 


1,990 


0.18 





26.4 


0.73 


2334 


0.5 


16.7 


27.1 


0.33 


2.534 


033 


8.4 


27.3 


0.74 


2.712 


1.4 


40 


27.4 


0.54 


2.712 


1.05 


38.5 


28.6 


0.56 


2209 


1.7 


50 


2S.6 


0.56 


2209 


135 


544 


Males Onlv 










Tdbf^C) 


RH 


? v(la>a) 


PMV' n»as 


A^U n»as 




20.3 


030 


1.194 


-12 


30 




21.5 


C3C 


C.^76 


-0.6 


20 




21.3 


CiC 


12T 


-0.67 


16.7 




■i-i ">■> 


CiC 


1354 


-0.72 


14.3 




21.8 


0.70 


1.846 


-C.^5 


8.3 




23 


035 


0593 


^.6 


10 




23 


032 


1.475 


-0.8 


10 




23 


0.76 


2.156 


-0.25 







25 


C3f 


1.109 


-02 







25 


C3C 


13 85 


0.167 







24.7 


0.74 


23 C8 


0.17 







25.7 


034 


1.793 


0.3 







27 


031 


1.116 


0.33 


8,4 




27.1 


033 


1.195 


0.33 


8.4 




27.4 


034 


1S90 


1 


373 




26.4 


0.73 


2334 


0.7 


20 




27.3 


0.74 


2.^12 


1.73 


57.1 




2S.6 


036 


22C5 


1.42 


5S.3 
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Problem 5.13 Chiller performance data 



Table B 3 Nomenclature similar to the figure for Example 5.4.3 










T.„(°F) 


T^CF) 


Q,b (tons) 


Po„p(kW) 


V (gmp) 


1 


43 


85 


378 


247 


1125 


2 


43 


80 


390 


245 


1125 


3 


43 


75 


399 


241 


1125 


4 


43 


70 


406 


238 


1125 


5 


45 


85 


397 


257 


1125 


6 


45 


80 


409 


254 


1125 


7 


45 


75 


417 


250 


1125 


8 


45 


70 


432 


263 


1125 


9 


50 


85 


430 


267 


1125 


10 


50 


80 


445 


268 


1125 


11 


50 


75 


457 


269 


1125 


12 


50 


70 


455 


259 


1125 


13 


43 


85 


340 


247 


500 


14 


43 


80 


360 


249 


500 


15 


43 


75 


375 


248 


500 


16 


43 


70 


387 


246 


500 


17 


45 


85 


360 


258 


500 


18 


45 


80 


379 


259 


500 


19 


45 


75 


394 


258 


500 


20 


45 


70 


405 


255 


500 


21 


50 


85 


390 


268 


500 


22 


50 


80 


407 


267 


500 


23 


50 


75 


421 


266 


500 


24 


50 


70 


437 


267 


500 


25 


43 


85 


368 


248 


813 


26 


43 


80 


382 


247 


813 


27 


43 


75 


393 


244 


813 


28 


43 


70 


401 


240 


813 


29 


45 


85 


388 


258 


813 


30 


45 


80 


401 


256 


813 


31 


45 


75 


411 


253 


813 


32 


45 


70 


420 


249 


813 


33 


50 


85 


420 


269 


813 


34 


50 


80 


435 


268 


813 


35 


50 


75 


447 


267 


813 


36 


50 


70 


459 


269 


813 


37 


45 


85 


397 


257 


1125 


38 


45 


82.5 


358 


225 


1125 


39 


45 


80 


318 


201 


1125 


40 


45 


77.5 


278 


178 


1125 


41 


45 


75 


238 


156 


1125 


42 


45 


72.5 


199 


135 


1125 


43 


45 


70 


159 


113 


1125 


44 


45 


67.5 


119 


91 


1125 


45 


45 


85 


360 


258 


500 


46 


45 


82.5 


324 


224 


500 


47 


45 


80 


288 


199 


500 


48 


45 


77.5 


252 


175 


500 


49 


45 


75 


216 


153 


500 


50 


45 


72.5 


180 


131 


500 


51 


45 


70 


144 


109 


500 


52 


45 


67.5 


108 


90 


500 
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Problem 5.14 Effect of tube cleaning in reducing chiller fouling 



Table B 4 Nomenclature similar to the figure for Example 5.4.3 

Compressor Chilled Water 



Cooling Water ReMgerant (R-11) 



Date 



Hour 



P 

comp 

(kW) 



cho 

CF) 



cdi 

CF) 



Evap Temp. 
(°F) 



Cond Temp. 

CF) 



CooUng 

load 

(Tons) 



9/11/2000 


0100 


423.7 


45.9 


81.7 


38.2 


97.9 


1290.7 




0200 


417.7 


45.8 


82.0 


38.3 


97.9 


1275.5 




0300 


380.1 


45.8 


81.6 


38.6 


93.6 


1114.9 




0400 


366.0 


45.6 


81.6 


38.6 


96.1 


1040.6 




0500 


374.9 


45.7 


81.6 


38.6 


96.4 


1081.5 




0600 


395.0 


45.7 


81.7 


38.3 


94.1 


1167.9 




0700 


447.9 


45.9 


82.5 


38.1 


99.1 


1353.8 




0800 


503.8 


46.0 


82.5 


37.3 


100.1 


1251.6 




0900 


525.9 


45.9 


81.3 


36.2 


97.2 


816.7 




1000 


490.4 


46.0 


81.3 


37.0 


98.6 


784.0 




1100 


508.7 


46.0 


81.2 


36.8 


98.6 


802.3 




1200 


512.1 


45.9 


81.4 


36.5 


96.9 


825.8 




1300 


515.5 


45.9 


81.4 


36.6 


99.4 


834.2 




1400 


561.4 


45.8 


81.5 


35.5 


100.1 


901.1 




1500 


571.2 


46.0 


81.4 


35.5 


98.6 


922.0 




1600 


567.1 


46.5 


81.4 


35.9 


100.6 


921.5 




1700 


493.4 


48.8 


81.3 


39.4 


94.3 


806.3 




1800 


592.5 


46.2 


81.9 


35.5 


96.7 


955.2 




1900 


560.7 


45.9 


82.2 


35.7 


100.6 


900.8 




2000 


513.7 


45.9 


81.9 


36.4 


99.6 


825.7 




2100 


461.0 


45.9 


81.8 


37.4 


96.1 


734.2 




2200 


439.0 


45.8 


81.6 


37.7 


97.7 


692.7 




2300 


431.7 


45.8 


81.7 


37.8 


97.2 


680.3 




2400 


435.4 


45.9 


81.6 


37.8 


94.6 


676.2 


9/12/2000 


0100 


398.3 


45.7 


82.3 


38.3 


95.9 


993.4 




0200 


371.7 


45.6 


82.5 


38.6 


97.4 


1121.6 




0300 


369.3 


45.5 


82.5 


38.5 


96.7 


1081.1 




0400 


361.5 


45.7 


82.5 


38.6 


94.9 


1056.9 




0500 


374.0 


45.8 


82.6 


38.7 


97.4 


1079.4 




0600 


389.5 


45.7 


82.5 


38.4 


97.2 


1116.0 




0700 


394.4 


45.8 


82.6 


38.4 


95.9 


1122.7 




0800 


425.6 


45.8 


82.6 


38.2 


98.9 


1237.0 




0900 


475.1 


45.9 


83.1 


37.9 


99.6 


1441.1 




1000 


492.9 


46.0 


83.3 


37.7 


99.1 


1522.0 




1100 


531.5 


46.0 


83.4 


37.0 


102.2 


1627.2 




1200 


576.0 


46.2 


82.9 


36.3 


101.8 


1181.6 




1300 


603.6 


46.5 


82.6 


35.7 


101.3 


955.9 




1400 


514.9 


48.3 


83.3 


39.0 


95.1 


1437.8 




1500 


556.7 


45.9 


83.6 


36.1 


99.1 


1696.9 




1600 


548.2 


45.9 


83.6 


36.1 


99.1 


1664.2 




1700 


540.6 


45.9 


83.6 


36.2 


101.3 


1651.6 




1800 


556.8 


45.9 


83.6 


36.2 


101.1 


1682.2 




1900 


589.1 


46.1 


83.6 


35.9 


101.3 


1736.3 




2000 


584.3 


46.1 


83.5 


35.7 


102.9 


1697.4 




2100 


586.8 


46.1 


83.6 


35.8 


103.6 


1657.4 




2200 


572.4 


45.9 


83.5 


35.6 


101.8 


1592.8 




2300 


528.3 


46.0 


83.3 


36.8 


101.5 


1464.5 



Appendix 415 

Table B 4 (continued) 







Compressor 


Chilled Water 


Cooling Water 


Refrigerant (R- 11) 




Cooling 
load 


Date 


Hour 


P 

coinp 

(kW) 


cho 

(°F) 


cdi 

CF) 


Evap Temp. 
(°F) 


Cond Temp. 
(T) 


(Tons) 




2400 


499.9 


46.0 


83.4 


37.4 


101.3 


1416.0 


9/13/2000 


0100 


471.8 


46.0 


83.4 


37.8 


98.6 


1373.2 




0200 


463.9 


46.0 


83.4 


37.9 


99.6 


1365.8 




0300 


469.7 


46.0 


83.3 


37.8 


100.6 


1390.1 




0400 


471.7 


45.8 


83.3 


37.7 


98.4 
99.9 


1390.7 




0500 


464.2 


45.9 


83.3 


37.9 


1386.9 




0600 


463.9 


45.9 


83.3 


37.9 


100.3 


1389.5 




0700 


482.8 


46.0 


83.3 


37.7 


98.9 


1470.8 




0800 


493.8 


46.0 


83.3 


37.5 


99.1 


1513.6 




0900 


511.3 


46.0 


83.4 


37.3 


101.8 


1554.0 




1000 


557.1 


46.0 


83.5 


36.5 


102.7 


1658.2 




1100 


573.2 


46.0 


83.5 


36.1 


101.3 


1694.9 




1200 


573.3 


46.0 


83.4 


36.0 


103.4 


1681.8 




1300 


577.2 


46.1 


83.4 


36.1 


103.2 


1691.7 




1400 


577.3 


46.2 


83.3 


35.9 


101.3 


1702.1 




1500 


580.0 


46.4 


83.3 


36.0 


103.4 


1718.0 




1600 


588.0 


46.3 


83.4 


36.0 


103.4 


1739.4 




1700 


588.9 


46.3 


83.3 


35.7 


101.8 


1730.7 




1800 


574.2 


46.0 


83.2 


35.5 


103.2 


1684.7 




1900 


555.6 


46.0 


83.0 


36.3 


102.0 


1629.6 




2000 


535.6 


46.0 


82.8 


36.5 


99.9 


1603.6 




2100 


497.8 


46.0 


82.8 


37.3 


100.8 


1499.6 




2200 


469.4 


45.9 


82.7 


37.6 


99.4 


1429.5 




2300 


446.1 


45.9 


82.4 


37.9 


96.9 


1370.2 




2400 


426.5 


45.8 


82.2 


38.0 


98.6 


1306.7 


Cleaning 


1/17/2001 


0100 


408.8 


44.5 


87.0 


39.0 


94.9 


1125.1 




0200 


400.6 


44.5 


86.9 


39.2 


95.1 


1097.9 




0300 


397.4 


44.4 


86.9 


39.1 


94.9 


1070.8 




0400 


402.0 


44.5 


86.8 


39.1 


95.4 


1104.7 




0500 


394.1 


44.4 


86.8 


39.1 


95.1 


1062.5 




0600 


398.8 


44.5 


87.0 


39.2 


95.6 


1093.1 




0700 


403.9 


44.5 


86.8 


39.1 


95.6 


1102.9 




0800 


418.2 


44.5 


86.9 


39.0 


96.1 


1167.6 




0900 


430.1 


44.6 


86.9 


39.1 


96.7 


1201.5 




1000 


457.4 


44.6 


87.0 


38.9 


97.7 


1285.7 




1100 


478.6 


44.8 


87.2 


38.9 


98.2 


1355.1 




1200 


499.2 


44.8 


87.3 


38.7 


98.9 


1411.0 




1300 


513.1 


44.8 


86.9 


38.4 


98.9 


1460.5 




1400 


530.1 


44.8 


87.0 


38.1 


99.1 


1517.8 




1500 


525.9 


44.8 


87.0 


38.0 


99.1 


1492.4 




1600 


513.4 


44.8 


87.0 


38.1 


98.9 


1465.7 




1700 


501.2 


44.8 


87.0 


38.1 


98.4 


1437.0 




1800 


493.4 


44.8 


87.2 


38.3 


98.4 


1405.0 




1900 


470.8 


45.0 


87.2 


38.6 


97.7 


1352.2 




2000 


462.6 


45.0 


87.1 


38.7 


97.7 


1316.7 




2100 


460.4 


44.9 


87.3 


38.5 


97.7 


1301.9 




2200 


443.3 


44.7 


87.2 


38.5 


97.2 


1239.1 




2300 


428.3 


44.5 


87.2 


38.4 


96.9 


1194.8 
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Table B 4 (continued) 
















Compressor 


Chilled Water 


Cooling Water 


Refrigerant (R- 11) 




Cooling 
load 


Date Hour 


P 

comp 

(kW) 


T 

cho 

(°F) 


(°F) 


Evap Temp. 
(°F) 


Cond Temp. 
(T) 


(Tons) 


2400 


414.5 


44.5 


87.1 


38.5 


96.7 


1140.3 


1/18/2001 0100 


406.4 


44.4 


87.1 


38.5 


96.4 


1124.1 


0200 


398.6 


44.5 


87.1 


38.7 


96.7 


1089.6 


0300 


402.9 


44.4 


87.0 


38.6 


96.4 


1100.2 


0400 


392.3 


44.3 


86.8 


38.6 


96.4 


1057.7 


0500 


402.3 


44.3 


87.0 


38.5 


96.9 


1098.4 


0600 


391.9 


44.3 


87.0 


38.6 


96.7 


1053.8 


0700 


411.8 


44.4 


87.0 


38.5 


97.4 


1128.3 


0800 


424.9 


44.4 


86.9 


38.4 


97.7 


1177.8 


0900 


442.4 


44.5 


86.6 


38.2 


97.9 


1249.7 


1000 


454.4 


44.7 


86.5 


38.2 


98.2 


1296.4 


1100 


473.0 


44.7 


86.7 


38.1 


98.9 


1347.5 


1200 


477.5 


44.8 


86.8 


38.0 


99.1 


1354.7 


1300 


484.3 


44.7 


87.0 


38.0 


99.4 


1376.5 


1400 


486.4 


44.7 


86.9 


37.8 


99.4 


1387.4 


1500 


487.4 


44.7 


86.8 


37.8 


99.4 


1381.0 


1600 


508.8 


44.8 


86.9 


37.5 


100.1 


1442.1 


1700 


506.7 


44.8 


86.8 


37.4 


100.1 


1436.5 


1800 


509.0 


44.8 


86.9 


37.3 


100.1 


1443.1 


1900 


503.5 


44.8 


86.9 


37.3 


99.9 


1427.3 


2000 


492.3 


44.8 


87.0 


37.5 


99.9 


1407.8 


2100 


477.9 


44.7 


87.0 


37.5 


99.6 


1376.9 


2200 


456.5 


44.7 


86.8 


37.7 


98.9 


1302.3 


2300 


445.7 


44.6 


86.8 


37.8 


98.9 


1262.6 


2400 


432.0 


44.5 


86.1 


37.7 


98.4 


1240.9 


1/19/2001 0100 


426.9 


44.5 


85.7 


37.7 


98.9 


1208.7 


0200 


416.4 


44.5 


85.6 


37.8 


98.9 


1189.5 


0300 


416.4 


44.5 


85.6 


37.9 


99.1 


1200.7 


0400 


414.2 


44.6 


85.6 


37.9 


99.1 


1200.5 


0500 


411.6 


44.5 


85.8 


37.9 


99.4 


1182.7 


0600 


421.4 


44.5 


85.7 


37.8 


99.6 


1209.2 


0700 


429.2 


44.6 


85.8 


37.8 


100.1 


1219.9 


0800 


447.3 


44.7 


86.0 


37.7 


100.6 


1280.4 


0900 


481.0 


44.7 


86.1 


37.4 


101.3 


1371.2 


1000 


532.0 


44.8 


86.5 


36.7 


102.9 


1498.4 


1100 


535.1 


44.8 


86.8 


35.5 


103.2 


1503.0 


1200 


529.8 


44.9 


86.5 


36.5 


102.9 


1491.9 


1300 


503.3 


44.9 


86.5 


36.7 


102.5 


1433.9 


1400 


449.4 


44.7 


85.9 


37.4 


100.6 


1287.2 


1500 


517.0 


44.9 


86.3 


36.7 


102.5 


1468.9 


1600 


501.2 


44.8 


86.4 


36.6 


102.2 


1450.2 


1700 


443.6 


44.8 


86.0 


38.1 


100.8 


1287.0 


1800 


368.8 


44.6 


86.0 


39.1 


98.6 


1045.5 


1900 


376.4 


44.6 


85.9 


39.2 


98.9 


1051.5 


2000 


383.7 


44.6 


85.8 


39.1 


98.9 


1084.7 


2100 


390.2 


44.6 


85.9 


39.1 


99.6 


1110.9 


2200 


382.9 


44.5 


85.8 


39.1 


99.4 


1089.8 


2300 


364.5 


44.5 


85.7 


39.2 


98.9 


999.2 


2400 


354.3 


44.4 


86.1 


39.2 


99.1 


964.0 
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Problem 9.4 Time series analysis of sun spot frequency per year from 1770-1869 



Table B5 Time series analysis of sun spot frequency per year from 1770-1869. (From Montgomery and Johnson 1976 by permission of 
McGraw-Hill) 



Year 


n 


Year- 


n 


Year 


n 


Year 


n 


1770 


101 


1795 


21 


1820 


16 


1845 


40 


1771 


82 


1796 


16 


1821 


7 


1846 


62 


1772 


66 


1797 


6 


1822 


4 


1847 


98 


1773 


35 


1798 


4 


1823 


2 


1848 


124 


1774 


31 


1799 


7 


1824 


8 


1849 


96 


1775 


7 


1800 


14 


1825 


17 


1850 


66 


1776 


20 


1801 


34 


1826 


36 


1851 


64 


1777 


92 


1802 


45 


1827 


50 


1852 


54 


1778 


154 


1803 


43 


1828 


62 


1853 


39 


1779 


125 


1804 


48 


1829 


67 


1854 


21 


1780 


85 


1805 


42 


1830 


71 


1855 


7 


1781 


68 


1806 


28 


1831 


48 


1856 


4 


1782 


38 


1807 


10 


1832 


28 


1857 


23 


1783 


23 


1808 


8 


1833 


8 


1858 


55 


1784 


10 


1809 


2 


1834 


13 


1859 


94 


1785 


24 


1810 





1835 


57 


1860 


96 


1786 


83 


1811 


1 


1836 


122 


1861 


77 


1787 


132 


1812 


5 


1837 


138 


1862 


59 


1788 


131 


1813 


12 


1838 


103 


1863 


44 


1789 


118 


1814 


14 


1839 


86 


1864 


47 


1790 


90 


1815 


35 


1840 


63 


1865 


30 


1791 


67 


1816 


46 


1841 


37 


1866 


16 


1792 


60 


1817 


41 


1842 


24 


1867 


7 


1793 


47 


1818 


30 


1843 


11 


1868 


37 


1794 


41 


1819 


24 


1844 


15 


1869 


74 



Problem 9.5 Time series of yearly atmospheric CO concentrations from 1979-2005 



Table B6 Time series 
versity Press) 


of yearly atmospheric 


COj concentrations from 1979 


-2005. 


(From 


Andrews and . 


lelley 2007 by permission of Oxford Uni- 


Year 


CO, cone 


Temp. 


diff 


Year 


COj cone 


Temp 


.diff 


Year 


CO, cone 


Temp, diff 


1979 


336.53 


0.06 




1988 


350.68 


0.16 




1997 


362.98 


0.36 


1980 


338.34 


0.1 




1989 


352.84 


0.1 




1998 


364.9 


0.52 


1981 


339.96 


0.13 




1990 


354.22 


0.25 




1999 


367.87 


0.27 


1982 


341.09 


0.12 




1991 


355.51 


0.2 




2000 


369.22 


0.24 


1983 


342.07 


0.19 




1992 


356.39 


0.06 




2001 


370.44 


0.4 


1984 


344.04 


-0.01 




1993 


356.98 


0.11 




2002 


372.31 


0.45 


1985 


345.1 


-0.02 




1994 


358.19 


0.17 




2003 


374.75 


0.45 


1986 


346.85 


0.02 




1995 


359.82 


0.27 




2004 


376.95 


0.44 


1987 


347.75 


0.17 




1996 


361.82 


0.13 




2005 


378.55 


0.47 
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Problem 9.6 Time series of monthly atmospheric CO, concentrations from 2002-2006 



Table B7 


Time series of monthly 


atmospheric 


CO, concentrations from 2002-2006 








2002 




2003 


2004 


2005 


2006 


Month 


COj concent. 


COj concent. 


COj concent. 


COj concent. 


COj concent. 


1 


372.4 




374.7 


377.0 


378.5 


381.2 


2 


372.8 




375.4 


377.5 


379.0 


381.8 


3 


373.3 




375.8 


378.0 


379.6 


382.1 


4 


373.6 




376.2 


378.4 


380.6 


382.5 


5 


373.6 




376.4 


378.3 


380.7 


382.5 


6 


372.8 




375.5 


377.4 


379.4 


381.5 


7 


371.3 




374.0 


375.9 


377.8 


379.9 


8 


370.2 




372.7 


374.4 


376.5 


378.3 


9 


370.5 




373.0 


374.3 


376.6 


378.6 


10 


371.8 




374.3 


375.6 


377.9 


380.6 


11 


373.2 




375.5 


377.0 


379.4 


- 


12 


374.1 




376.4 


378.0 


380.4 


- 



Problem 9.8 Transfer function analysis using simulated hourly loads in a commercial building 



Table BS 


Transfer function analysis 


using simulated hourly loads in a commercial building 






Month 


Day 


Hour 


Tdb 

(°F) 


Twb Qint 
(°F) (kW) 


Total Power 
(kW) 


Cooling 

(Btu/h) 


Heating 

(Btu/h) 


8 




1 


63 


60 594.8 


844.6 


3,194,081 


427,362 


8 




2 


63 


61 543.4 


791.8 


3,152,713 


445,862 


8 




3 


63 


61 543.4 


789.8 


3,085,678 


446,532 


8 




4 


64 


61 543.4 


791.7 


3,145,654 


452,646 


8 




5 


64 


61 655.8 


906.4 


3,217,818 


434,740 


8 




6 


65 


63 879.9 


1160.3 


4,119,484 


835,019 


8 




7 


68 


65 1107.6 


1447.1 


5,651,144 


734,486 


8 




8 


72 


66 1086.4 


1470.7 


6,644,373 


474,489 


8 




9 


75 


66 1086.3 


1487.3 


6,972,290 


397,013 


8 




10 


78 


68 933.2 


1392.6 


8,161,122 


335,897 


8 




11 


80 


69 934.6 


1423.8 


8,669,268 


306,703 


8 




12 


81 


69 922.8 


1412.4 


8,642,472 


328,010 


8 




13 


81 


69 934.8 


1426.8 


8,688,363 


297,011 


8 




14 


82 


69 934.8 


1429.2 


8,730,540 


287,861 


8 




15 


80 


67 933.7 


1405.4 


8,300,878 


276,095 


8 




16 


82 


68 934.7 


1434.6 


8,877,782 


256,967 


8 




17 


81 


67 787.1 


1278.3 


8,683,202 


345,135 


8 




18 


79 


66 1153.3 


1639.0 


8,600,599 


402,828 


8 




19 


79 


66 1358.7 


1828.2 


8,307,491 


446,782 


8 




20 


74 


65 1448.6 


1873.0 


7,416,578 


519,596 


8 




21 


73 


65 1345.4 


1746.3 


6,958,935 


551,707 


8 




22 


66 


62 1104.7 


1425.3 


5,100,758 


643,950 


8 




23 


66 


63 790.3 


1096.2 


4,781,506 


748,713 


8 




24 


65 


63 646.3 


932.0 


4,227,319 


365,618 


8 


2 


1 


64 


61 594.7 


856.9 


3,551,712 


394,635 


8 


2 


2 


63 


61 543.3 


794.8 


3,239,986 


430,030 


8 


2 


3 


64 


62 543.5 


799.9 


3,400,810 


443,896 


8 


2 


4 


63 


60 543.2 


785.0 


2,924,189 


434,391 


8 


2 


5 


65 


62 656.0 


916.0 


3,522,530 


447,170 
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8 


2 


6 


65 


62 


879.6 


1169.9 


4,388,385 


771,258 


8 


2 


7 


71 


66 


1108.6 


1503.2 


6,925,686 


668,155 


8 


2 


8 


75 


68 


1087.7 


1533.7 


7,897,258 


387,154 


8 


2 


9 


80 


70 


1089.8 


1603.5 


9,130,235 


318,321 


8 


2 


10 


81 


69 


935.1 


1447.1 


9,021,254 


287,550 


8 


2 


11 


84 


70 


935.9 


1464.8 


9,347,346 


278,832 


8 


2 


12 


84 


70 


924.0 


1450.0 


9,264,431 


307,150 


8 


2 


13 


85 


71 


936.6 


1476.4 


9,511,320 


275,150 


8 


2 


14 


85 


71 


936.9 


1491.1 


9,731,617 


251,269 


8 


2 


15 


85 


71 


937.4 


1505.1 


9,956,236 


247,798 


8 


2 


16 


85 


72 


938.4 


1529.7 


10,342,235 


247,176 


8 


2 


17 


84 


71 


789.8 


1360.3 


9,955,993 


244,046 


8 


2 


18 


81 


71 


1156.8 


1730.2 


10,041,017 


379,881 


8 


2 


19 


79 


70 


1361.4 


1898.3 


9,412,865 


425,252 


8 


2 


20 


76 


70 


1451.7 


1954.6 


8,844,201 


495,725 


8 


2 


21 


74 


69 


1347.8 


1811.4 


8,125,985 


524,132 


8 


2 


22 


74 


69 


1108.4 


1536.9 


7,465,724 


571,150 


8 


2 


23 


73 


69 


793.4 


1192.4 


6,868,408 


648,661 


8 


2 


24 


72 


68 


648.9 


1014.8 


6,152,558 


304,174 


8 


3 


1 


71 


68 


597.1 


944.8 


5,755,310 


320,848 


8 


3 


2 


71 


68 


545.5 


883.7 


5,538,507 


333,868 


8 


3 


3 


70 


67 


544.9 


867.3 


5,150,570 


339,970 


8 


3 


4 


71 


66 


544.4 


858.2 


4,947,774 


346,538 


8 


3 


5 


71 


68 


657.9 


990.4 


5,434,792 


344,485 


8 


3 


6 


73 


69 


882.7 


1262.4 


6,514,805 


687,770 


8 


3 


7 


77 


71 


1111.3 


1560.6 


7,936,272 


608,746 


8 


3 


8 


80 


74 


1092.5 


1644.8 


9,760,724 


377,956 


8 


3 


9 


83 


75 


1094.1 


1705.1 


10,594,630 


301,946 


8 


3 


10 


85 


76 


941.2 


1590.0 


11,124,807 


270,331 


8 


3 


11 


87 


77 


942.6 


1627.3 


11,613,918 


246,362 


8 


3 


12 


87 


76 


929.6 


1591.3 


11,207,712 


274,067 


8 


3 


13 


87 


76 


941.7 


1608.5 


11,338,916 


237,845 


8 


3 


14 


89 


77 


943.2 


1644.6 


11,852,497 


237,519 


8 


3 


15 


89 


77 


943.3 


1651.3 


11,887,520 


237,746 


8 


3 


16 


89 


76 


942.4 


1635.4 


11,664,144 


236,992 


8 


3 


17 


87 


76 


794.3 


1468.9 


11,436,989 


238,088 


8 


3 


18 


87 


77 


1162.6 


1867.1 


11,894,025 


363,009 



19 



20 



21 



85 



85 



79 



75 



1365.8 



2014.9 



75 



1456.7 



2088.4 



74 



1352.0 



1929.0 



11,001,879 

10,839,194 

9,960,708 



407,210 
454,740 
487,662 



8 


3 


22 


78 


76 


1114.2 


1667.6 


9,627,610 


528,198 


8 


3 


23 


78 


75 


798.1 


1306.9 


8,801,340 


593,055 


8 


3 


24 


79 


75 


654.5 


1144.3 


8,512,977 


280,531 
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C: Solved Examples and Problems with 
Practical Relevance 



Example 1.3.1: Simulation of a chiller 

Problem 1.7 Two pumps in parallel problem viewed from 

the forward and the inverse perspectives 

Problem 1.8 Lake contamination problem viewed from the 

forward and the inverse perspectives 

Example 2.2.6: Generating a probability tree for a residen- 
tial air-conditioning (AC) system 

Example 2.4.3: Using geometric PDF for 50 year design 
wind problems 

Example 2.4.8: Using Poisson PDF for assessing storm fre- 
quency 

Example 2.4.9: Graphical interpretation of probability 
using the standard normal table 

Example 2.4.11: Using lognormal distributions for pollu- 
tant concentrations 

Example 2.4.13: Modeling wind distributions using the 
Weibull distribution 

Example 2.5.2: Forward and reverse probability trees for 
fault detection of equipment 

Example 2.5.3: Using the Bayesian approach to enhance 
value of concrete piles testing 

Example 2.5.7: Enhancing historical records of wind velo- 
city using the Bayesian approach 

Problem 2.23 Probability models for global horizontal so- 
lar radiation. 

Problem 2.24 Cumulative distribution and utilizability 
functions for horizontal solar radiation 
Problem 2.25 Generating cumulative distribution curves 
and utilizability curves with measured data. 

Example 3.3.1: Example of interpolation. 
Example 3.4.1: Exploratory data analysis of utility bill 
data 

Example 3.6.1: Estimating confidence intervals 
Example 3.7.1: Uncertainty in overall heat transfer coeffi- 
cient 

Example 3.7.2: Relative error in Reynolds number of flow 
in a pipe 

Example 3.7.3: Selecting instrumentation during the expe- 
rimental design phase 

Example 3.7.4: Uncertainty in exponential growth models 
Example 3.7.5: Temporal Propagation of Uncertainty in ice 
storage inventory 

Example 3.7.6: Using Monte Carlo to determine uncertain- 
ty in exponential growth models 

Problem 3.7 Determining cooling coil degradation based 
on effectiveness 



Problem 3.10 Uncertainty in savings from energy conser- 
vation retrofits 

Problem 3.11 Uncertainty in estimating outdoor air fraction 
in HVAC systems 

Problem 3.12 Sensor placement in HVAC ducts with consi- 
deration of flow non-uniformity 

Problem 3.13 Uncertainty in estimated proportion of expo- 
sed subjects using Monte Carlo method 
Problem 3.14 Uncertainty in the estimation of biological 
dose over time for an individual 

Problem 3.15 Propagation of optical and tracking errors in 
solar concentrators 

Example 4.2.1 : Evaluating manufacturer quoted lifetime of 
light bulbs from sample data 

Example 4.2.2: Evaluating whether a new lamp bulb has 
longer burning life than traditional ones 
Example 4.2.3: Verifying savings from energy conserva- 
tion measures in homes 

Example 4.2.4: Comparing energy use of two similar build- 
ings based on utility bills- the wrong way 
Example 4.2.5: Comparing energy use of two similar build- 
ings based on utility bills- the right way 
Example 4.2.8. Hypothesis testing of increased incidence 
of lung ailments due to radon in homes 
Example 4.2.10: Comparing variability in daily productivi- 
ty of two workers 

Example 4.2.11: Ascertaining whether non-code complian- 
ce infringements in residences is random or not 
Example 4.2.12: Evaluating whether injuries in males and 
females is independent of circumstance 
Example 4.3.1: Comparing mean life of five motor bearings 
Example 4.4.1: Comparing mean values of two samples by 
pairwise and by Hotteling T^ procedures 
Example 4.5.1: Non-parametric testing of correlation bet- 
ween the sizes of faculty research grants and teaching eva- 
luations 

Example 4.5.2: Ascertaining whether oil company resear- 
chers and academics differ in their predictions of future at- 
mospheric carbon dioxide levels 

Example 4.5.3: Evaluating predictive accuracy of two cli- 
mate change models from expert elicitation 
Example 4.5.4: Evaluating probability distributions of 
number of employees in three different occupations using a 
non-parametric test 

Example 4.6.1 : Comparison of classical and Bayesian con- 
fidence intervals 

Example 4.6.2: Traditional and Bayesian approaches to de- 
termining confidence levels 

Example 4.7.1: Determination of random sample size nee- 
ded to verify peak reduction in residences at preset confi- 
dence levels 
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Example 4.7.2: Example of stratified sampling for variance 
reduction 

Example 4.8.1: Using the bootstrap method for deducing 
confidence intervals 

Example 4.8.2: Using the bootstrap method with a nonpa- 
rametric test to ascertain correlation of two variables 
Problem 4.4 Using classical and Bayesian approaches to 
verify claimed benefit of gasoline additive 
Problem 4.5 Analyzing distribution of radon concentration 
in homes 

Problem 4.7 Using survey sample to determine proportion 
of population in favor of off-shore wind farms 
Problem 4.8 Using ANOVA to evaluate pollutant concen- 
tration levels at different times of day 
Problem 4.9 Using non-parametric tests to identify the bet- 
ter of two fan models 

Problem 4.10 Parametric test to evaluate relative perfor- 
mance of two PV systems from sample data 
Problem 4.11 Comparing two instruments using parame- 
tric, nonparametric and bootstrap methods 
Problem 4.15 Comparison of human comfort correlations 
between Caucasian and Chinese subjects 

Example 5.3.1: Water pollution model between solids re- 
duction and chemical oxygen demand 
Example 5.4.1: Part load performance of fans (and pumps) 
Example 5.4.3: Beta coefficients for ascertaining importan- 
ce of driving variables for chiller thermal performance 
Example 5.6.1: Example highlighting different characteris- 
tic of outliers or residuals versus influence points. 
Example 5.6.2: Example of variable transformation to re- 
medy improper residual behavior 

Example 5.6.3: Example of weighted regression for repli- 
cate measurements 

Example 5.6.4: Using the Cochrane-Orcutt procedure to re- 
move first-order autocorrelation 

Example 5.6.5: Example to illustrate how inclusion of ad- 
ditional regressors can remedy improper model residual be- 
havior 

Example 5.7.1: Change point models for building utility 
bill analysis 

Example 5.7.2: Combined modeling of energy use in regu- 
lar and energy efficient buildings 

Example 5.7.3: Proper model identification with multiva- 
riate regression models 

Section 5.8 Case study example effect of refrigerant additi- 
ve on chiller performance 

Problem 5.4 Cost of electric power generation versus load 
factor and cost of coal 

Problem 5.5 Modeling of cooling tower performance 
Problem 5.6 Steady-state performance testing of solar ther- 
mal flat plate collector 
Problem 5.7 Dimensionless model for fans or pumps 



Problem 5.9 Spline models for solar radiation 
Problem 5.10 Modeling variable base degree-days with ba- 
lance point temperature at a specific location 
Problem 5.11 Change point models of utility bills in variab- 
le occupancy buildings 

Problem 5.12 Determining energy savings from monitoring 
and verification (M&V) projects 

Problem 5.13 Grey-box and black-box models of centrifu- 
gal chiller using field data 

Problem 5.14 Effect of tube cleaning in reducing chiller 
fouling 

Example 6.2.2: Evaluating performance of four machines 
while blocking effect of operator dexterity 
Example 6.2.3: Evaluating impact of three factors (school, 
air filter type and season) on breathing complaints 
Example 6.3.1: Deducing a prediction model for a 2' fac- 
torial design 

Example 6.4.1: Optimizing the deposition rate for a tungs- 
ten film on silicon wafer 

Problem 6.2 Full-factorial design for evaluating three diffe- 
rent missile systems 

Problem 6.3 Random effects model for worker productivity 
Problem 6.6 7? factorial analysis for strength of concrete 
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inference, 125 
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binomial, 37, 47, 52, 136, 282, 313 

chi-square, 44, 45, 113, 125, 314, 401 

continuous, 38, 41, 44, 76, 368 

cumulative, 32, 38, 57, 59, 366, 373, 411, 420 

density, 32, 44, 46, 57, 59, 70 

discrete, 33, 35, 38 

exponential, 44, 58, 311, 

F, 37,46, 113, 117, 120,402 

gamma, 44, 58 

Gaussian, See normal 

geometric, 37, 38, 44, 70 

hypergeometric, 37, 39 

joint, 30, 33, 45, 57, 73, 119 

lognormal, 37, 43, 57, 395 

marginal, 30, 34, 57 

multinomial, 40 
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323, 399 

poisson, 40, 44, 58, 1 14, 283, 349, 398 

Rayleigh, 45 

sampling, 103, 112, 126, 133, 322, 334 

standard normal, 42, 78, 104, 108, 401 

student t, 43, 84, 118 

uniform, 46, 52, 57, 125 

Weibull,45, 313, 323 
Discriminant analysis, 231, 236, 250 
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curves, 17,316,349,362,382 
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Dummy variables, 169 
Durbin- Watson statistic, 166, 412 
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main, 186, 201 

random, 62, 185,203 
Effects plot, 117, 137 
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Error sources 

data, 82, 100 

regression, 157, 166 
Error sum of squares 

ANOVA,in, 116, 143 

regression, in, 143 
Estimability, concept, 289 

Estimates of parameters, 23, 146, 159, 171, 304, 319, 352 
Estimation methods 

Bayes, 22, 52, 344 

least squares, 22, 141, 158, 362 

maximum likelihood, 23, 234, 277, 289, 308, 310, 314, 320 
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Exploratory data analysis, 19, 23, 71, 73, 96, 172 
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random, 18, 186, 190 
Factor analysis, 119, 307 
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2'' design, 183, 192, 199 
Factor level, 185 
False negatives, 49, 107, 378 
False positives, 49, 108, 280, 378 
Family of distributions, 45 
F distribution 

table, 73, 113 
Factorial experimental designs 

fractional, 191, 198, 201, 332 
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2'' experiment, 183, 192, 199 
Fault detection and diagnosis, 49, 378 
First differencing, 270 
Fitted (predicted) values, 145, 153, 190, 197, 239, 243, 

258,317 
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Forecasts 

adaptive, 260 

confidence intervals, 254 

exante, 255, 287 

expost, 255, 287 
Forward problems, 13, 23 
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Frequency distributions, 103, 285 
Frequentist approach to probability, 28, 127 
F table, 117 
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ANOVA, 122, 246 

regression models, 146, 171 
Full model, 151, 170,305 
Function of variables, 9, 33, 35, 211, 220, 310, 350 
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Hat matrix, 160 
Heteroscedasticity 

detection, 163 

removal, 161, 165 
Histogram, 4, 32, 73, 92, 134, 173, 285, 312, 373 
Hotteling's T^ test, 120 
Hypergeometric distribution, 39 
Hypothesis testing 

alternative, 107 

ANOVA, 116, 122, 184 

correlation coefficient, for, 127 

distribution, for, 127 

mean, 110, 127 

null, 106, 124 

paired difference, for, 110, 127 

variance, for, 45, 127 
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Identification of parameters, 15, 95, 151, 157, 166, 180, 255, 268, 289, 

292, 302, 305, 307, 310, 315, 348 
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numerical, 289, 351 

structural, 289, 351 
In regression 

single factor (one-way), 116, 119, 122, 125 
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effects, 150, 186, 193, 201 

predictor, 150, 238, 321 

regression, in, 340 

sum of squares, 187 
Intercept of a line, 311 
Interquartile range, 71, 74, 109, 281 
Interval estimates. See Confidence interval 
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lackknife method, 133 

loint density, function, 34, 57 

Joint probability distribution, 57, 141 
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Kruskal-Wallis test, 125 



Lagrange multiplier method, 212 
Large-sample confidence interval, 112, 134 
Latin square design, 191 
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estimator, 158, 313 

weighted, 23, 158, 162 
Level of significance, 189 
Leverage points in regression, 160 
Likelihood function, 52, 238, 312, 323, 376 
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Linear regression 
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polynomial, 165, 329 
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regression, 289, 314 
Logit function, 315 
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Main effects, 186, 190, 193, 199 
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Marginal distribution, 34, 57 
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Maximum likelihood estimation, 23, 234, 277, 289, 308, 310, 314, 
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distribution of sample, 84 

estimate, 84 

geometric, 70 
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square error, 70, 118, 129, 145, 152, 163, 293, 299 

standard error, 103, 131, 147, 281 
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weighted, 70 
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Mean bias error, 145, 304, 339 
Mean square, 100, 117, 152, 164, 187 
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Mean time, between failures, 253 
Mean value 

continuous variable, 35, 41, 44, 127, 142, 263, 266, 271, 277 

discrete variable, 38, 70, 82, 84, 86, 93, 105, 110, 116, 119, 147, 
188,232,280 
Measurement system 

derived, 64 

errors, 82 

primary, 64, 139 

types, 64 

systems, 61 
Median, 36, 55, 70, 109, 129, 133, 281, 316, 320, 338, 366 
Memoryless property, 40 
Mild outlier, 74, 96 
Mode, 2, 36, 70, 92 
Model 

black-box, 10, 13, 16, 23, 171, 179, 181, 220, 300, 327, 341-343, 
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macroscopic, 157, 166, 327 

mathematical, 1, 24, 141, 149, 199, 207, 340, 363, 377 
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multiple regression, 150, 154, 163 
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reduced, 170, 195, 202, 204 
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Monte Carlo methods 
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during regression, 157 
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detection, 293 

effects on parameter estimation, 156 
Multifactor ANOVA, 185 
Multinomial function, 37, 40 
Multiple sample comparison 

Dunnett's, 406 

Tukey, 118, 137 
Multiple regression 

forward stepwise, 302 

backward stepwise, 249 

stagewise, 289, 302 
Mutually exclusive events, 29 
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Nonlinear regression, 317 

Nonparametric tests 

Kruskall-Wallis, 125 

Spearman rank correlation, 124 

Wilcoxon, 123 
Nonrandom sample, 128 
Normal distribution 

bivariate, 115, 120 

probability plot, 75, 190 

standard, 43, 138, 399 

table, 401 
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Objective functions 
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Odds ratio, 315 
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One-way ANOVA, 117 
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Optimization 
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non-linear, 221 

penalty function, 318 
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Bayesian, 22 
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maximum likelihood, 23, 289 
Part-load performance of energy equipment 
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294, 300, 302 
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Piece- wise linear regression, 169 
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bar, 73, 75 
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dot, 76, 78 

histogram, 73, 75, 173, 312, 373 
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245, 266, 267, 286, 300, 317, 338, 346 
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trimmed, 70 
Plots of model residuals, 165, 190 
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unbiased, 129, 152 
Poisson distribution, 40, 58, 283 
Polynomial distribution, 68 
Polynomial regression, 150, 327 
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one-sample, 106 
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Bayesian, 22, 27, 47 
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Properties of least square estimators, 151 
Proportion 

sample, 112 
p-values, 106, 117, 121, 152, 174, 198, 202 



Quadratic regression, 77, 150 
Quantile, 73, 136 
Quantile plot, 74 
Quartile, 70, 74, 320, 339 

R 

Random 

effects, 62, 187, 188, 189, 203, 204 

errors, 64, 82, 86, 88, 100, 156, 185, 257, 268, 319 

experiment, 19, 27, 116 

factor, 85, 116, 184, 187, 220, 323 

number generator, 93, 128, 272 

randomized block experiment, 189 

sample, 43, 52, 56, 103, 112, 116, 128, 131, 133,282, 
321, 377 

variable, 27, 32, 35, 38, 41, 52, 104, 108, 113, 120, 129, 141, 258, 
260, 277, 281, 366, 375 
Randomized complete block design, 189 
Range, 8, 15, 33, 63, 66, 68, 70, 76, 83, 95, 109, 133, 192, 214, 218, 

243,281,318,337,366 
Rank of a matrix, 29 1 
Rank correlation coefficient, 122, 407 
Rayleigh distribution, 45 



Reduced model, 170, 195, 202, 204 
Regression 

analysis, 21, 141, 188, 235, 320, 354 

coefficients, 146, 156, 196, 294, 304 

function, 148, 157, 165, 236, 244 

general additive model, 151 

goodness-of-fit, 141, 195 

linear, 23, 142, 150, 151, 154, 156, 235, 261, 298, 
303,315 

logistic, 236, 313, 315 

model evaluation, 145 

model selection, 340 

multiple linear (or multivariate), 150, 151, 154, 156, 170, 187, 195, 
198, 204, 235, 241, 245, 289, 298, 304, 307 

nonlinear, 319, 354 

ordinary least squares (or OLS), 22, 142, 144, 146, 148, 151, 156, 
162, 166, 229, 232, 236 

parameters, 81, 146, 151, 254, 311 

polynomial, 149, 150, 164, 170, 328, 342 

prediction, individual response, 148 

prediction, mean response, 147 

quadratic, 77, 150 

residuals. See Residuals 

robust, 86, 158,289,319 

simple linear (or univariate), 142, 146, 162, 261, 308, 310, 320, 
344 

stagewise, 304, 305 

stepwise, 151, 170, 174, 249, 302, 320 

sum of squares, 291 

through origin, 167 

weighted least squares, 162, 164, 314 
Regressing time series, 260 
Relative frequency, 27, 32, 50, 54, 126 
ReUabiUty, 20, 45, 56, 67, 84, 282, 382 
Replication, 128, 185, 187, 189, 193, 201, 204, 321, 346 
Resampling procedures, 103, 122, 132, 148 
Residuals 

analysis or model, 23, 141, 157, 162, 165-167, 176, 179, 241, 289, 
308,310,317,319 

non-uniform variance, 162, 189 

plots, 165, 189,261,300,317 

serially correlated, 158, 262, 165 

standardized or R-statistic, 160 

sum of squares, 143, 170 
Response surface, 23, 183, 192, 199, 332 
Response variable, 12, 19, 117, 141, 144, 149, 154, 168, 183, 192, 

201, 234, 240, 294, 298, 302, 307, 313, 343 
Ridge regression 

concept, 298 

trace, 298 
Risk analysis 

assessment of risk, 377, 380, 382, 383, 386 

attitudes, 360, 363, 368 

communication of risk, 378 

discretizing probability distributions, 366 

evaluating conseqences, 377 

events, 360, 362, 364, 368, 382 

hazards, 377, 384, 386 

indifference plots, 392 

management of risk, 377, 380, 383 

probability of occurrence, 363, 366, 377 
Risk modeling, 361 
Risk perception, 381 
Risk regulation, 380 

Robust regression, 86, 158, 289, 318, 319 
Root mean square error, 145 
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S 
Sample 

mean, 46, 93, 103, 104, 106, 113, 116, 118, 126, 135, 189,278, 
282, 284 

non-random sampling, 128 

random sampling, 43, 52, 56, 103, 104, 108, 112, 113, 116, 117, 
128, 130, 131, 134, 281, 321, 327, 333, 376 

resampling, 103, 122, 132, 148, 291, 320, 321 

size, 4, 52, 70, 85, 104, 108, 116, 126, 128, 130, 131, 132, 281, 
311 

space, 27, 30, 48, 56, 127 

stratified sampling, 110, 128, 132, 190, 333 

survey, 130, 131, 136 

with replacement, 39, 128, 133, 134, 320 

without replacement, 39, 104, 128, 133, 193, 333 
Sampling distribution 

of the difference between two means, 118, 174 

of a sample mean, 103, 104, 106 

of a proportion, 112, 114 
Sampling inspection, 4 
Scaling 

decimal, 72 

min-max, 72, 232 

standard deviation, 72, 232 
Scatter plot. See Plots of data-scatter 
Scatter plot matrix, 78, 81 
Screening design, 192, 199 
Search methods 

blind, 335 

calculus-based, 200, 211, 215, 221 

steepest descent, 200, 215, 317 
Seasonal effects modeling, 191, 255 
Selection of variables, 289 
Sensors 

accuracy, 61 

calibration, 63 

hysteresis, 63 

modeling response, 10 

precision, 62 

resolution, 62 

rise time, 64 

sensitivity, 63 

span, 62 

time constant, 64 
Sensitivity coefficients, 90, 294, 331, 344 
Shewart charts, 280, 282, 284 
Sign test, 45, 122, 124 
Significance level, 84, 106, 108, 117, 123, 148, 152, 

334, 378 
Simple event, 27 

Simple linear regression. See Regression-simple linear 
Simulation models, 1, 9, 12, 220, 327, 331 
Simulation error, 145 
Single-factor ANOVA, 116, 119, 125 
Skewed distributions, 47 
Skewed histograms, 73 
Slope of line, 81, 142, 147, 149, 150, 160, 166, 168, 254, 263, 267, 

282, 290, 309, 320 
Smoothing methods 

arithmetic moving average, 258 

exponential moving average, 258, 259 

for data scatter. See LOWESS 

splines, 69, 177 
Solar thermal collectors 

models, 177 

testing, 4 



Standard deviation 

large sample, 70, 84, 93, 108, 112, 134 

small sample, 84, 105, 131 
Standard error, 103, 131, 147, 281 
Standardized residuals, 160 
Standardized variables, 109, 236, 295, 298 
Standard normal table, 42, 107, 126, 420 
Stationarity in time series, 256 
Statistical control, 279-281, 283 
Statistical significance, 188, 236, 254 
Statistics 

descriptive, 19, 23, 287, 301 

inferential, 19, 23, 45, 103, 128, 142 
State variable representation, 348 
Stepwise regression, 151, 170 
Stochastic time series, 267, 268 
Stratified sampling, 110, 128, 132 
Studentized t- distribution 

confidence interval, 118, 400 

critical values, 46, 105, 125, 147, 400 

statistic, 37, 43, 108, 110, 146, 159 

test, 118 
Sum of squares 

block, 184 

effect, 195 

error, 116, 143, 187 

interaction, 187 

main effect, 1 86 

regression, 143 

residual, 187 

table, 117 

total, 117, 143 

treatment, 116, 143 
Supervisory control, 20, 218, 335, 389 



t (See student t) 
Test of hypothesis 

ANOVA F test, 113, 122, 146, 170 

bootstrap, 320 

chi-squared, 1 14 

correlation coefficient, 7 1 

difference between means, 174 

goodness-of-fit, 114, 144, 170,241 

mean, 108 

nonparametric, 103, 122 

normality, 114, 189,319 

one-tailed, 106, 108, 124 

two-tailed, 108, 124, 147 

type I, 50, 107, 121,280,378 

type 11, 50, 108, 280, 378 

Wilcoxon rank-sum, 123, 137, 408 
Time series 

analysis, 19, 23, 253, 257, 268, 417 

correlogram, 268, 271, 276 

differencing, 257, 270 

forecasting, 274 

interrupted, 266 

model identification, recommendations, 275 

smoothing. See smoothing methods 

stationarity, 268, 272 

stochastic, 268 
Thermal networks, 7, 24, 224, 225, 277, 356 
Total sum of squares 

ANOVA, 116, 117, 143, 172, 185, 187 

Transfer function models, 268, 275, 276 
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Transforming nonlinear models, 218 
Treatment 

level, 184, 190 

sum of squares, 116, 143 
Tree diagram, 31, 49, 240, 248, 365, 378, 358, 386 
Trend and seasonal models, 257, 260, 262, 275 
Trimmed mean, 70 
Trimming percentage, 7 1 
Truncation, 290 

Tukey's multiple comparison test, 118 
Two stage estimation, 163 
Two stage experiments, 4,31,48 
Two-tailed test, 108, 124, 147 

U 

Unbiased estimator, 112, 129, 130, 152 

Uncertainty 

aleotory, 207, 362, 367 

analysis, 82, 87, 94, 96, 199, 361 

biased, 82, 84 

epistemic, 19, 207, 332, 360, 362 

in data, 1 8 

in modeling, 21, 389 

in sensors, 10, 20, 62, 98 

overall, 85, 94 

propagation of errors, 86, 90, 94 

random, 37, 72, 83 
Uniform distribution, 46, 52, 125 
Unimodal, 36, 41,44 
Univariate statistics, 120 
Upper quartile, 70 
Utilizability function, 59 



dummy (indicator), 169 

explanatory, 163, 244 

forcing, 12 

independent, 5, 6, 9, 13, 76, 77, 86, 90, 95, 152, 169, 170, 175, 
184, 199-202, 211, 257, 277, 292, 294, 350, 389 

input, 5, 10, 15, 291, 345 

output, 5, 181, 207, 331, 344, 390 

predictor, 150, 170, 309 

random. See random variable 

regressor, 19, 68, 81, 95, 141, 146, 148, 154, 156, 234, 241, 289, 
294,297,305,313,314,343 

response, See response variable 

state, 6, 13,331,348 
Variable selection, 289 
Variable transformations, 162, 289, 322 
Variance, 35,43,70, 109, 113, 122, 129, 132, 145, 147, 151, 154, 156, 

158, 171, 232, 236, 283, 297, 298, 308 
Variance inflation factors, 295, 301 
Variance stabilizing transformations, 162, 289, 315 
Venn diagram, 29, 48 

W 

WeibuU distribution, 45, 311 

Weighted least squares, 23, 158, 163, 314 

Whisker plot diagrams, 74, 76, 109, 173 

White noise, 159, 253, 254, 257, 261, 270-272, 276, 277 

Wilcoxon rank-sum test, 123 

Wilcoxon signed rank test, 124 

Within-sample variation, 116 



Yates algorithm, 193 



Value of perfect information, 374 

Variable 

continuous. See Continuous variable 
dependent. See Dependent variable 
discrete, 33, 35, 37, 50 



Z (standard normal) 

confidence interval, 104, 269 
critical value, 104, 107, 109 
table, 42, 107, 126 



